InterpreterPoolExecutor - We Need PEP 734

Python programs can sometimes be compute bound in surprising ways. Recently I tried refactoring a program that downloaded 4 JSON files, parsed them, and made them available to be used in a larger program. When I rolled out my “improvement”, it actually made the code slower, and I had to quickly fix it. How could have I avoided this?

What We Should Expect from a Good Program

A few things would make our lives easier. Python has not traditionally made the following easy, but we are right on the cusp of having our cake and eating it too. Here’s what I would expect from a good program:

Easy to Parallelize. If the code is slow, we should be able to split it up.
Easy to Profile If the code is slow it should be easy to figure out why.

Let’s see if we can get both at the same time.

Hard to Parallelize

The original authors had used os.fork() to acheive parallelism, which has problems. I assumed that this was to avoid using threads directly, or some other reason, but it turned out to not be the case. “Downloading some JSON and sticking it in Redis? That’s definitely IO-bound”. Wrong. The JSON parser in Python is very slow. To the point that trying to download and parse all 4 versions ended up taking more than 60 seconds. The refresh interval for this code was only 1 minute long. When I replaced the fork-based code with a ThreadPoolExecutor, the code started taking minutes to nearly hours to finish. It seemed IO bound, but it was actually CPU bound.

Hard to Profile

A more seasoned engineer might point out that I should have profiled this code before trying to “optimize” it. However, Python only recently gained the ability to integrate with perf. Unfortunately, the implementation creates a new, PID-named file, at an unconfigurable location, each time the procress starts. In a fork-based concurrency world, that’s a lot of PIDs. And because these perf-based files aren’t small, it runs the risk of maxing out the disk of the server you are profiling on. Secondly, these forks flare into, and out-of existence quickly (i.e. seconds), so it’s hard to catch them in the act of what they are doing. A long lived process would be much easier to observe.

And Still Hard to Parallelize?

When I replaced my ThreadPoolExecutor with a ProcessPoolExecutor, this problem reared its head again. Because the processes associated with the pool aren’t associated with the tasks, it’s hard to identify which processes to profile. The same problem exists; tracking down all the PIDs associated with my pool is trickier. Secondly, switching from ThreadPoolExecutor to ProcessPoolExecutor is not straightforward. All the functions and arguments now need to be Pickle-able, meaning things like references to class methods no longer work.

Parallel, Profile-able Python

Python 3.14 adds a new module and APIs for creating sub-interpreters. (e.g. InterpreterPoolExecutor) Significant work has gone into CPython to make the Interpreter state a thread-local, meaning it’s possible to run multiple “Pythons” in the same process. This helps us a lot because it means we can get the parallelism we want, without the system overhead of running multiple processes. Specifically:

There’s no overhead of starting up multiple processes. Processes can share Page tables, Signal Handlers, file descriptors, and so on.
PIDs are way more stable. The Process ID of the parent thread is the same as the ID of the child (sub) threads.
Memory sharing (is | will be) easier. Rather than have to convert from Python objects in one interpreter to a serialized (cough Pickle cough) form, it will be much easier to synchronize with other workers. (also shout out to Ray which has done the hard work to make this sharing a lot easier).