We have outgrown the Process model
There are many fucked up things nowadays in the state of computing, programming and the industries around them. Contrary to what everyone wants to believe, there’s no single cause to point to for almost any of these things. But as I was thinking about a particular group of issues, I did some digging and realised that the Process model has contributed greatly to those issues, so I want to talk about that in this post. I’ll offer some perspectives for how we’ve outgrown the Process model and what seems to be the biggest obstacle to finding something different.
When I talk about the Process model, I’m referring to how the popular operating systems of our time run user code: they take an executable file, carve some piece of memory to throw it in, and start executing its code from an entry point. A process is initially “pure” by design: the code can’t affect the environment it’s executing in, it has to ask the operating system to do that (usually through a system call). The operating system manages multiple processes, enforces that their memory and resources are isolated (unless they ask to share things, and the operating system allows it), and ensures the processor(s) execute the code in each process in a fair way (for many definitions of fair).
To keep it focused, this post assumes at least some foundational knowledge of operating systems.
Why is the Process model like this?
The early digital computers from the 1940s and 1950s ran a single program, which was given to the computer in many physical forms: patch cables, switches, punched tapes and cards. Those programs were often used to do some math or equivalent work that is almost purely just plain computation.
As computers got faster, increased their amount of memory and introduced peripherals, the size and complexity of programs that people were writing increased (no longer just math calculations), but the physical input of programs became a bottleneck.
At this point, input was done usually through punched tapes and cards. Computers essentially ran a single program at a time, and thus only had “one user” active at any point, even though they were shared among teams of researchers/programmers (they were expensive machines). Precious time was spent preparing programs to run. Some computers gained rudimentary operating systems 1 to run programs in a queue, executing the next program as soon as the current finished executing, in an attempt to efficiently use its processor.
It was in the 1960s that people came up with ways of making computers run multiple programs at the same time and/or allow multiple users to use the same computer at the same time. This is where the Process model starts to show up.
Before this, programs could assume control of peripheral and system resources, since it was guaranteed that the machine would only execute one program at a time. This became a challenge when running multiple programs, so operating systems started to gain more attention and features to prevent programs from messing with the state of others.
The way those operating systems made multiple programs run at the same time was to “simulate” each program running as if it was the only program running on the computer, which means that in theory, nobody had to rewrite programs that were already written. By then, there was definitely already an industry around computing, and I assume the same sort of pressure to reduce costs and release things fast.
A process was just like a batch job, something a user would start and then let it run to completion, and then get its results.
Later operating systems kept the same concepts, adding a few more things to this model (e.g. virtual memory, threads), and this is pretty much what we’ve had to work with since then. We’re still programming as if we were doing batch processing, we just have way more garnishes around it.
A design flaw
The Process model has a flaw accompanying it since its inception: showing an isolated, sterile environment to each program. This was a necessary workaround to make existing programs at the time run on the new operating systems without requiring changes, but we’ve already outgrown this.
Many programs we write today expect to communicate and collaborate with other programs, but our model still enforces this isolated environment, so operating systems give us workarounds by letting code request communication and sharing at runtime. This comes with a price, which is that every programmer has to figure out how they want to allow this communication to happen and/or how to connect to other programs.
Due to the sterile environment a program runs on, it needs to bring all functionalities it requires to run. Operating systems tried to help with this by supporting some form of shared code as a workaround, but that comes with its own can of worms. Shared libraries definitely leave a lot to be desired, so people either give up on using them (with static compilation) or they use workarounds like containers (making them ironically a workaround to other workarounds). But none of that helps processes take advantage of extra functionality that exists on a system, something that shared libraries never managed to solve (the super early code sharing features of Multics seem much closer to this ideal than later systems). Every process has to bring whatever it needs, regardless of how it does that.
We all keep making the same decisions and writing the same things over and over because of this flaw.
Threads and I/O
Processing time and I/O are two major bottlenecks that keep showing up throughout the history of digital computers. Early programs, focused on calculations, ran from minutes to hours . Later on, with programs that worked more like services, bottlenecks with peripherals and I/O started to appear. Processors became fast enough to compute things quickly, but because of the Process model (which forces a program to run serially), some programs also bottlenecked performing I/O.
Making processors with faster clocks is a straightforward way to fix the processing time bottleneck. However, we started to hit some physical limits trying to do that, but we still wanted programs to run faster. The best solution we came up with for this was to have multiple processing units, allowing a program to “split” into a few parts and run each part on a different unit. This was kind of already a thing back in the 1960s, but the concept of threads as we know them today was only concretized decades later. Since then, in our quest for faster executions, we’ve gone through event loops, coroutines, async code and others, but all of those use threads as a foundation, since that’s the concurrency feature that the operating system provides to a process.
Threads helped alleviate the I/O bottleneck by letting some thread(s) continue computing while other(s) performed I/O, but this wasn’t ideal and more like a workaround. Operating systems eventually introduced other workarounds to speed up I/O operations, by letting a process offload the actual I/O work outside the program and into the OS (which is what we call now asynchronous I/O).
We’ve since been finding all sorts of ways to offload slow work to the operating system and out of a process, because that’s the only way we found to actually do more things and still follow the Process model.
From the select
system call to epoll
to io_uring
, we’ve gone through multiple iterations of “please do this, but not inside the process because I also want to compute more things”.
Running programs continuously
As we started to write programs to record, manipulate and query data (early databases), we broke an assumption of the Process model: that a program runs until completion. We moved from one-shot batch processing to services that were always available to receive and send data.
However, any code failure completely destroys the process, which means that programmers (with a bit of help from operating systems) needed to find ways to restart processes when they stopped or find ways to avoid code from crashing at the operating system level (usually both). And because a process always starts on a fresh new environment, programmers also had to ensure important state was persisted in some way (which means doing I/O, which links to the earlier issues with I/O in processes).
As soon as we started connecting machines over a network, our programs also ran on different machines and communicated with each other, and with that came the need to coordinate programs (either the same programs or different ones) running on multiple machines, and once again figuring out how to persist state when a process inevitably dies.
The Process model offers essentially no help at all with this, and operating systems have barely provided workarounds, so we had to build our own solutions, which means working around the Process model (at least) twice. Virtualization and containerization are often also constrained by the Process model.
Doing something different is hard
I hope by now you’re at least somewhat convinced that the Process model is working against a lot of the stuff we want to do with our code.
Most of the attempts to come up with different models that I’ve seen work at the level of a programming language. I think one of the most popular (which I also like) is Erlang and the BEAM. These efforts still happen on top of current operating systems, and so they’re ultimately also constrained by the Process model. The Process is an abstraction, and by definition it is leaky.
To do something really different, we need to reinvent things at the operating system layer. But reinventing the operating system is not that simple.
Throughout history, the development of processor and operating system features often intertwines. When new hardware features show up, the operating system usually needs new code to provide those features in some way to user code. And some things that operating systems historically tried to do all in software were only realised once hardware properly enabled that.
The problem with this is that processors then expect operating systems to behave in specific ways, which means that if someone wants to come up with a completely different model for an operating system, either they won’t be able to run it on existing processors or it’ll be difficult to do so. Abstraction leak shows up again.
This is why I think something different than the Process model is hard.
Every time I see some new OS project show up on the Internet, I get a bit excited that it’ll have a completely different design, but so far they all have roughly the same shape. The Process model’s limitations can be recognised in all of them. To reach a design that’s truly novel, we’ll also need to design our own processors.
When I reached this conclusion, I was demotivated for a while, but this isn’t as bad at it sounds like. Current processors are pretty powerful, but at this point they’re mostly solving problems that they created for themselves. For example, the many issues with instruction pipelining and speculative execution exist only because of the design of current processors, with their behemoth processing units that can do everything and are used to run a lot of different code concurrently.
If you’re inventing a new design, you’re not constrained by any of these issues until you choose to make the same decisions. Imagine a new processor that is extremely parallel by design and doesn’t allow multiple programs to reuse the same processing unit. It sidesteps many of the difficult things existing processors have to do to work well. It will also inevitably find its own share of issues to address, but hopefully the design space unlocked by a novel processor architecture will outweigh any new issues it brings.
This is something that’s been inspiring me to explore different ideas, and is why I’ve been experimenting with designing a new kind of processor and operating system (almost) from scratch.
Footnotes
-
See for example https://en.wikipedia.org/wiki/GM-NAA_I/O and https://en.wikipedia.org/wiki/BESYS ↩