InterpreterPoolExecutor - We Need PEP 734
Python programs can sometimes be compute bound in surprising ways. Recently I tried refactoring a program that downloaded 4 JSON files, parsed them, and made them available to be used in a larger program. When I rolled out my “improvement”, it actually made the code slower, and I had to quickly fix it. How could have I avoided this?
What We Should Expect from a Good Program
A few things would make our lives easier. Python has not traditionally made the following easy, but we are right on the cusp of having our cake and eating it too. Here’s what I would expect from a good program:
Easy to Parallelize. If the code is slow, we should be able to split it up.
Easy to Profile If the code is slow it should be easy to figure out why.
Let’s see if we can get both at the same time.
Hard to Parallelize
The original authors had used os.fork()
to acheive parallelism, which has
problems. I assumed that this
was to avoid using threads directly, or some other reason, but it turned out to
not be the case. “Downloading some JSON and sticking it in Redis? That’s
definitely IO-bound”. Wrong. The JSON parser in Python is very slow. To the
point that trying to download and parse all 4 versions ended up taking
more than 60 seconds. The refresh interval for this code was only 1 minute
long. When I replaced the fork-based code with a ThreadPoolExecutor, the code
started taking minutes to nearly hours to finish. It seemed IO bound, but it
was actually CPU bound.
Hard to Profile
A more seasoned engineer might point out that I should have profiled this code
before trying to “optimize” it. However, Python only
recently gained the
ability to integrate with perf
. Unfortunately, the implementation creates a
new, PID-named file, at an unconfigurable location, each time the procress
starts. In a fork-based concurrency world, that’s a lot of PIDs. And because
these perf-based files aren’t small, it runs the risk of maxing out the disk of
the server you are profiling on. Secondly, these forks flare into, and out-of
existence quickly (i.e. seconds), so it’s hard to catch them in the act of what
they are doing. A long lived process would be much easier to observe.
And Still Hard to Parallelize?
When I replaced my ThreadPoolExecutor with a ProcessPoolExecutor, this problem reared its head again. Because the processes associated with the pool aren’t associated with the tasks, it’s hard to identify which processes to profile. The same problem exists; tracking down all the PIDs associated with my pool is trickier. Secondly, switching from ThreadPoolExecutor to ProcessPoolExecutor is not straightforward. All the functions and arguments now need to be Pickle-able, meaning things like references to class methods no longer work.
Parallel, Profile-able Python
Python 3.14 adds a new module and APIs for creating sub-interpreters. (e.g. InterpreterPoolExecutor) Significant work has gone into CPython to make the Interpreter state a thread-local, meaning it’s possible to run multiple “Pythons” in the same process. This helps us a lot because it means we can get the parallelism we want, without the system overhead of running multiple processes. Specifically:
- There’s no overhead of starting up multiple processes. Processes can share Page
tables, Signal Handlers, file descriptors, and so on.
- PIDs are way more stable. The Process ID of the parent thread is the same as the ID of the child (sub) threads.
- Memory sharing (is | will be) easier. Rather than have to convert from Python objects in one interpreter to a serialized (cough Pickle cough) form, it will be much easier to synchronize with other workers. (also shout out to Ray which has done the hard work to make this sharing a lot easier).
The multiple-runtimes-in-one-process model is not new, with the most notable example being NodeJS. But, it is a greatly welcome addition to Python. Given the amazing improvements in GIL removal and JIT addition in Python 3.13, Python is becoming a much more workable language for server development.