Workflow engines, Kubernetes, and WebAssembly
Table of contents
- Why workflow engines?
- Workflow engines want to be fault-tolerant and persistent
- What is really executed?
- Workflow definition as code
- The ideal workflow engine
- WebAssembly
- A tricky use case
Why workflow engines?
Because we like to automate things, duh.
We have computers to do work for us. We usually want to take advantage of that and automate as much of our work as possible. If there’s some work that always follows a specific pattern, we’ll transform that into a workflow and let the computer have at it.
We could write each workflow as its own code, its own binary that runs somewhere, but that’s not really practical, so we go one layer up and write workflow engines to run the things we want to automate. A considerable size of software is all about automating workflows, and almost always because at some point a human will benefit from that.
This is where these workflow engines, job systems, task schedulers, automation orchestrators, and others come in. It’s how we provide an abstraction to everyone who wants to automate work.
There’s a wide range of work that ends up automated: sending an email, generating a report, paying bills, charging credit cards, processing images or videos, and so on. Even though I group all of this as work to be automated, the layer at which the automation happens will depend on the nature of the work, from the highest levels of user-defined automation (using a service like Zapier ) to the lowest levels of computer process automation (using an operating system). When looking through this lens, even a good part of software that we write is just encoding a workflow in some sense, even if it’s something as small as “retry an HTTP request until it succeeds”.
I’m using the term workflow engine to refer to all of these systems for work automation, even though most people nowadays wouldn’t consider all of them to be workflow engines. I haven’t found a better term to describe this category of systems that we use to automate work, and I think that term fits well. I hope that by writing about this, I’ll manage to convince readers that we can look at these systems as belonging to the same category, even though on the surface we may see them as different.
Workflow engines want to be fault-tolerant and persistent
Simple engines provide no fault tolerance at all, so if a workflow fails while it’s being executed, the entire state is gone. If the entire engine crashes or is shut down, the queued and executing workflows are also gone, and won’t be restarted upon restarting the engine. This is usually no good, but ironically, this is how operating system processes work, and the processes themselves have to bring persistency and fault-tolerance if they want that.
Almost every other workflow engine is trying to abstract away the operating system and the details of where code is actually running, which translates to a need to provide some sort of fault tolerance. Persistency is an important attribute to achieve fault tolerance — the workflows need to save their state somewhere, so it can be retrieved once the issues are gone and the engine is restarted.
However, persistency is a tricky term, and there are multiple levels of persistency offered by workflow engines. Some engines guarantee persistency at the workflow level, meaning that they guarantee a workflow will run to completion. In practical terms, if it fails at any point, the workflow will be retried from the beginning as many times as configured to do so, which means you have to be careful about operations that can’t or shouldn’t run more than once. Other engines guarantee persistency at the step level (that’s what I’m calling them due to my history with Step Functions), which is a level more granular than the entire workflow, allowing it to be picked up from where it failed, essentially reusing the work it had already done successfully. Arguably, persistency at the step level is more efficient and provides a better experience: retries are faster, people only need to worry about making specific steps idempotent (plenty of operations in workflows will already be idempotent by default). Because of that, they’re more desirable as a property of a workflow engine.
Sidekiq and AWS Lambda (and equivalents in other clouds) are examples of systems that offer persistence at the workflow level. AWS Simple Workflow, AWS Step Functions (and equivalents in other clouds), Netflix’s Conductor, and Uber’s Cadence are examples of systems that offer persistence at the step level.
For most engines, persistency is implemented with a database. At the workflow level, this is somewhat straightforward: persist the input and a unique identifier to the code/workflow to run. This isn’t as clear when it comes to the step level, because there is no agreement on what exactly a step is. Each engine deals with that individually, because this step level serves more like a logical separation, which brings us to the next question…
What is really executed?
Workflow engines will execute something, which can range from a binary sitting on disk, to a piece of code written in a programming language, to higher-level descriptions of work in a domain-specific language (in these cases, the engine acts as an interpreter, and a part of the actual code being run is the engine’s, not yours).
Usually, a step is more like a logical separation, which will contain a description of work to do, including extra information like how many times to retry, what to do when all retries fail, and so on.
AWS Step Functions tried to standardise this higher-level description a bit with the Amazon States Language The ASL works if an engine views workflows as state machines. There are engines that see workflows as something more complicated (e.g. DAGs), and others that have a more simplified view (e.g. a linear sequence), so each engine ends up with its own unique ways of describing work.
The important bit is that all these engines went a level of abstraction higher — introducing a language to describe work — to solve the problem of not being able to directly run code. This introduces friction as a result, since people have to figure out how to break their workflows into steps, and then describe the workflow as a sequence of those steps, and give that to the engine. This usually results in gigantic JSON/YAML files, adding to the friction by making things harder to read, since so much in those files is engine fluff. As far as I know, this is the majority of the experience of using workflow engines out there.
Workflow definition as code
Inspired by Infrastructure as Code tooling, in the past I worked on my own tool to bring some sanity back to workflow definition languages. Instead of learning a particular flavour of the JSON/YAML needed for a workflow engine, my goal was to let people write regular Python code, and compile that code into a workflow definition language, using the engine’s infrastructure (e.g. control flow, data manipulation) when applicable, and packaging “custom” code to run as a step when needed.
At the time, I thought this was great — finally we’d be able to simplify our workflows! No more slow development cycles, altering JSON/YAML and reuploading it to the workflow engine for testing, hoping that we got things right for once, and usually paying all the time in case of workflow engines running in the cloud.
Not only was this a surprisingly hard task (I was building a compiler, after all), I found little support from the environment I was working in. Years later, I learned that someone else also had the same idea and started work on a similar project , using the same approach and the same programming language, and they got generally the same amount of progress that I did before abandoning the project too (when this post was written, the last commit on that project had been made roughly 3 years earlier).
And only some more time later, I started questioning myself: why are we even doing this? Why add yet another layer on top of the existing workflow engines to try to make up for their flawed design? Trying to describe work in a higher-level language is a solution at the wrong level of abstraction. In fact, all these workflow engines are working at the wrong level of abstraction.
As I said before, even an operating system works as a workflow engine, albeit a simple one that makes running code persistently and in a fault-tolerant way a task too complicated. It’s because of these flaws from the OS that we started adding all sorts of layers on top of it, so why not work on a better workflow engine instead?
The ideal workflow engine
The ideal solution would be to just write some code and execute it in a workflow engine, which would be able to reliably persist that code execution, as well as break down that code into smaller steps, and know which steps can/should rerun in case of failures. This would eliminate the need for workflow engines that work at different levels of abstraction, it might as well just be part of the operating system layer.
Since workflow engines are usually written separately from the workflow code, and receive the code purely as an external object to be executed, it’s hard to define and separate steps from the code. However, if the engine performs manipulation of the code/binary given to it or it’s embedded into the code/binary as a library, this becomes possible. With some restructuring and helpful hints to whatever compiler/interpreter we’re using, this could allow the workflow engine to identify boundaries of interest in the code, and naturally derive steps from them. From there, we’d have to figure out which steps are idempotent and which are not, and then figure out a way to persist non-idempotent steps.
Determining steps
Around the time I was working on that workflow-definition-as-code project, I started thinking about how to figure out workflow steps from code, until it dawned on me that we already do this: we organise code into functions, so why not use functions as the boundary of interest? This would eliminate the need for hints in the code.
Obviously, this would mean that the compiler/interpreter would have to figure out which functions are idempotent and which are not, so perhaps we could instead use hints to signal whether a function is idempotent or not.
Defining a workflow as a sequence of steps is an important property for engines that see workflow code as an external artifact. We’re now thinking about the engine possibly manipulating the code it’s given, but we’re still being coerced into figuring out how to separate steps in the code. Since we’re dropping the assumption that workflow code is external, can we also rethink whether we need steps? Why even break a workflow in steps to persist each one?
Instead of using functions as boundaries of interest and requiring programmers to annotate them regarding their idempotency, can we dive into the actual operations the code is doing, and determine which operations are not idempotent and which ones are?
If we can do this, what remains is figuring out how to turn non-idempotent code idempotent. This means that the persistency of the execution needs to be idempotent as well. We have to make sure those specific bits of code will produce the same effects no matter the scenario, whether it’s the first time running or the 100th.
Can we make everything idempotent?
Code starts in a pure state, which means there’s no way for it to have knowledge of anything external to it right from the beginning. This makes it idempotent! It’s only when the code requests things from the OS (e.g. opening a file, making a network request, starting another process) that we have non-idempotency introduced to us. The things provided by the environment are non-idempotent. This is a super cool insight, and also one of those duh moments.
Almost every operation can be made idempotent by wrapping the non-idempotent bit into an idempotent operation, but in practice things get harder partly because wrapping the code changes its properties (e.g. it will run slower than the original) and introduces different failure modes, and partly because the OS can only guarantee properties about the operations requested by the code up to a certain point, which is insufficient in some cases to be sure whether an operation succeeded.
Throughout the thinking I’ve done so far, I have identified two main external sources that require an “idempotency treatment”: time and the network.
The issue with time
It’s not uncommon for workflows to wait some time as part of an execution. There’s a lot of value in automating things that require waiting time because machines are much better at keeping track of time and remembering to do something after the specified amount of time has passed. It’s important to persist the execution after the wait is done to ensure the correctness of the workflow, and optionally before the wait for efficiency.
However, there are workflows that require the current time and manipulate it or use it for control flow or other logic. In general, it’s impossible to guarantee idempotency and correctness for these workflows. It might be possible to ensure the same time is reused for later executions, but things become especially problematic if the time is eventually sent to some other external source.
To achieve idempotency, the creator of the workflow needs to know about this problem in advance and decide what to do in all edge cases that come from using the current time. The engine might also need to provide extra infrastructure to support the workflow in those edge cases.
The issue with the network
What happens when the workflow sends a network request, but the machine running the workflow code becomes unavailable (shutdown, crash, hardware issue, whatever)? Or when the network is unreliable and the packets never make it to their intended destinations?
It is possible to solve these issues if both communicating parties agree on a protocol to help (e.g. use a deterministic id on every request), but just like time, in the general case it’s impossible to guarantee idempotency. We can do a best-effort attempt at idempotency, but there are still edge cases that will have to be treated directly by whoever’s creating the workflow.
I mentioned that time and network are two main external sources that need special treatment, but there’s kind of a third one that some may consider another main source: disk and similar persistency devices (e.g. SSDs). I’m grouping these together with network, since their failure modes are pretty much a subset of the network’s. And considering the current trend of separating disk/persistency into its own (web) service (e.g. databases have already gone this way, but also AWS’s S3 and similar services), it makes sense to keep everything under the same category.
Persisting the execution
Once we know when to persist an execution, how do we do that?
Since we’re running any kind of code, we need to persist the entire state of the execution, which includes register values and used memory. However, there are some things which may not make sense to persist: those are usually file descriptors, bits from shared memory, and other handles to things provided by the operating system. The things that link the code to the non-idempotency around it.
We can figure out which things are external handles because we know those will show up whenever code requests things from the operating system, so tracking the result of the OS operation is a relatively solved problem. And since we’re making the workflow engine run at the level of the OS, it also has this information.
This information is essential when picking up an execution from the persisted data: it tells the OS/machine what resources it will need to create/retrieve and make available to the code again before continuing to run the code. This is easier said than done, but I believe it’s completely doable with the correct infrastructure.
To support this, I’ll briefly discuss two things from Kubernetes .
Accessing storage from Kubernetes
Kubernetes provides access to storage through volumes , and these are almost all decoupled from the actual machine running the code. Which means that Kubernetes has to make some preparations to access the storage abstracted by a volume, and then provide that storage to a container.
The ideal workflow engine could do the same thing, going further and disallowing any local files at all (Kubernetes still has support for those, but it explicitly states this is potentially a very bad idea). Since the storage is decoupled from the machines running code, any machine could access the storage used by the workflow execution.
For a “local file” workaround, workflow code can just keep things in memory. This approach is the same that Phantom OS takes with files.
Service routing in Kubernetes
Kubernetes allows multiple machines to route traffic between each other. It also supports a load balancer doing this job, but in case a specific machine directly receives a request for a service, that machine can route that traffic to some other machine that is a better fit for handling the request, essentially serving as an impromptu load balancer.
The same approach could be used in some catastrophic scenarios, guaranteeing that whatever new machine ends up reviving the workflow execution will still be able to use the resources that existed on the previous machine.
This is a rare case because other catastrophic scenarios involve the initial machine shutting down, which means access to the resources will be lost anyway, but it’s a neat trick to keep in mind.
Realistically, losing a network connection isn’t too big of a deal because most software is already built with assumptions that network connections are not 100% reliable, sockets can close, packets can be lost, and so on. Many protocols reflect this assumption by forcing the relevant state to be held at the application level, which means a connection might be lost and retried, and everything would still work.
WebAssembly
After describing (admittedly at a very high level) the ideal workflow engine above, I’d like to point to a technology that I think can make all of this possible in a much easier way. After all, some of what I described — tracking external handles, persisting registers and memory — sounds complex and full of numerous little edge cases from the decades of features and workarounds encrusted into any major OS at this point.
After I started playing with Kubernetes, by chance I got to read about WebAssembly again, and I realised that WebAssembly fits a lot of the needs of this workflow engine, because from a certain perspective, WebAssembly gives you the infrastructure to create your own OS.
Its virtual machine specification forces a pure (i.e. idempotent) environment by default, unless the host running the WebAssembly code defines non-idempotent functions that can be used. This means the workflow engine can be completely in control of all idempotent code, and also completely in control of any external resources that it provides to workflow execution code, which are the non-idempotent bits.
Additionaly, there are already many programming languages supporting it at some level, and already a bunch of people putting effort on improving things, which means that (hopefully) the developer experience and the capabilities of the virtual machine will only get better in the future. This makes it a very compelling technology for a workflow engine/OS.
There are things which will require effort to be solved, but at least for now, it looks like this engine might be very possible. Heck, I might even start playing with this at some point and see what comes out of it.
A tricky use case
Everything that I wrote here, and many of the things I thought about, had the assumption that at some point the workflow execution ends. This assumption excludes workflows like web servers, which want to run forever.
However, as soon as a web server accepts a connection, expectations are that the connection will end at some point. There might be an opportunity for the workflow engine itself to act as the “run forever” part of a web server, spawning new workflow executions whenever a connection is made. However, I haven’t thought much about this yet.
There might be other use cases which don’t fit some other assumption I made while thinking about this. If I ever think of something else that might not be a good fit, I’ll try to update this part of the post.