I’ll start with a disclaimer just to be very clear: I worked in the past on AWS Step Functions, which is a workflow orchestration engine. I specifically interviewed for that team because I was and still am interested in all this workflow thing. However, I haven’t been employed at AWS for over a year, and nothing that I say here has any affiliation with AWS and/or with the Step Functions team. It’s all my own opinion and my own research.
Throughout this post, I’ll talk about some technical things in simplified ways. In some cases, it’s because the details really don’t matter much to the point I want to make. In other cases, it’s because I really don’t know better and am oblivious to the smaller details. I hope the small details don’t change the overall point of the post, but if they do, let me know so I can do some more thinking and reevaluate what I said :)
How this started
A recent post about Sidekiq made me revisit quite a few things I’ve held in my head for many years, and now I decided to put them in writing. First of all, Sidekiq’s success seems to be very well deserved, and I congratulate Mike for what he’s accomplished. I don’t know when exactly I first read about Sidekiq, but over the years I’ve followed some updates (mostly shared on Hacker News), and both the software and the business model around it are inspiring on many levels.
The mentioned post goes through a bit of history of job systems in Ruby, noting that the CEOs of Shopify and GitHub both worked on their own job systems. While those are not “complete” workflow engines in the sense most people would expect nowadays, they do share some similarities. What’s funny is if you try to search for more systems like that, you’ll come across resources like this which only touch the surface on the number of workflow engines out there (and these are only the public and more well known ones!). Searching ProductHunt for words like “workflow”, “orchestration”, “automation” you’ll find more stuff, although they generally try to cater to a different audience (more end-user-ish). Nonetheless, they’re still worthy of inclusion in this workflow/job/task orchestration thing.
The earliest I remember developing opinions and ideas in this area was back when I was studying computer science at university. I studied operating systems quite a bit, and have always been fascinated by all the intricacies of them. If you think about it, an operating system is also a kind of job/task orchestrator. After all, it needs to schedule processes, threads, and every other kind of computation that needs to get done! Scheduling these is not too straightforward. There are too many things to consider, like processes that may be waiting on signals or some external source, or code that requested to be run on specific cores, interrupts getting in the way of already-running code, and so much more.
I think just like game developers have the urge to create their own game engine (just a recent example, no intention of shaming just to be clear), software engineers in general kind of have an urge to create their own operating systems (and some may have done so as part of a university course), or at the very least create their own workflow/job/task scheduling/orchestration system/engine. In a way, workflows exist everywhere, and it’s just too tempting to want to automate them all with our computer whispering super powers.
This is how I got started thinking about all this. What follows is almost a decade of passively thinking about these things every couple of weeks or whenever I read something in this area, and a few years of experience working on a workflow engine.
Why workflow engines?
Because we love automating things, duh.
We have computers that do work for us. We usually want to take advantage of that and automate as much of our work as possible. If there’s a workflow that always follows a specific pattern, it’s kinda obvious we’ll want to automate that workflow. That’s why you see so many automation projects on ProductHunt. Those projects are in one end of the spectrum of workflow automation, which is closer to a human.
On the other end, we have been writing all this code and it needs to run somewhere, so we also need to automate code executions. That’s what operating systems (also) do, and it’s on the other end of the spectrum.
However, somewhere in the middle, we need something that glues both of those ends. If we wanted to, we could write every human-facing workflow as its own code, its own binary to be run somewhere, but that’s not really practical. Besides, one of the things software developers are very good at (sometimes way too good, if you ask me) is identifying abstractions to be able to reuse work as much as possible.
When you look at it this way, a lot of the work that software does is just automating workflows, and almost always because at some point a human will benefit from that.
Small digression, but if you look at some reviews of the Factorio game from software engineers, you’ll see many of them saying how the game is just like software engineering. Factorio is a game about factory automation. I also agree the game is very much like software engineering, and the cool thing is you don’t write any code while playing the game. It’s just not a game about writing code. I find it fascinating how the game managed to capture the nature of software engineering so well.
This area in the middle is where these workflow engines, job systems, task schedulers, automation orchestrators (call them whatever you want, I’ll call them workflow engines) all sit. It’s where we provide an abstraction to everyone who wants to automate their workflows that are closer to humans, while also being a system automated itself (usually) by some operating system (there may be a few other layers of automation before it gets there, but I’m keeping it simple).
I say “closer to humans” above because we’ve built way too many systems to do all kinds of things for us, and sometimes we even need to automate workflows between those systems. In fact, we need to do this so often that nowadays most workflows that run in workflow engines are all about automating and integrating these different systems together.
And then maybe on top of that we have the more end-user-ish-facing services that automate things for real humans, but they too mostly automate a bunch of systems together.
It’s important to emphasize that I’m using the term “workflow engine” to refer to a lot more than what most people would consider to be a workflow engine nowadays. I’m doing this on purpose, and because I haven’t found a better term to describe this category of systems that we use to run workflows (of any kind). It may be wrong from a pedantic point of view, but I find it helpful because it helps to show similarities between many systems we think are completely different on the surface. I’m trying to provide a different perspective that ties a lot of things together, and I hope I’m successful at that.
Workflow engines want to be fault-tolerant and persistent
(From this point on, I’ll focus on this middle layer of workflow engines I’ve talked about.)
Some very simple engines provide no fault tolerance at all, which means that if a workflow fails while it’s being executed, the whole thing is gone. Additionally, if the entire engine crashes or is shut down, the queued and executing workflows are also gone, and won’t be restarted upon restarting the engine. These systems are akin to most operating system processes (although there are some projects trying to make operating systems more persistent, like Phantom OS).
However, almost every workflow engine is trying to abstract away the operating system and the details of where code is actually running, which translates to a need to provide some sort of fault tolerance. Persistency is an important attribute to achieve fault tolerance - the workflows still need to complete even in case a machine gets shutdown or whatever, and the workflow engine needs to be able to retrieve its state just before a failure or crash. To be fault-tolerant, the engines need persistency.
Persistency, however, is a very tricky thing to accomplish, and there are multiple levels of persistency offered by workflow engines.
There are many other ways to see and categorize workflow engines, but I will care almost entirely about the persistency aspect here. For example, some engines see workflows as state machines, others see them as DAGs, others really don’t care and just execute code or binaries that you give them. None of that is very relevant to this post.
There are systems that guarantee persistency at the workflow level, meaning that they guarantee a workflow will run to completion. In practical terms, it means a workflow will either succeed, or if it fails at any point, it will be retried as many times as configured to do so. In these systems, if a workflow throws an error at some point, they will be retried from the beginning, which means you have to be very careful about operations that can’t or shouldn’t run more than once. Idempotency is a very important thing here (I’ll touch more on this later).
Sidekiq and AWS Lambda (and the equivalents in other clouds) are examples of systems like this.
“But AWS Lambda is a serverless compute service, it’s not a workflow engine!”
AWS Lambda is many things. But you can see each execution of a Lambda function as essentially a small workflow execution. Lambda makes sure the function runs to completion, and can even retry executions in some cases if you want it to. Keep in mind I’m using “workflow engine” to refer to more than just what people would consider “workflow engine” nowadays, and a job/task system is included in that.
Other systems guarantee persistency at the step level, meaning that they too guarantee a workflow will run to completion, but they also guarantee earlier steps already completed won’t be retried in case of a failure.
These systems are more interesting to us, because you can be less careful around non-idempotent operations. You know whenever a non-idempotent operation succeeds, it won’t run again even if the workflow fails at a later point and is retried (most engines actually let you control this behavior, but the fact that they allow only specific parts of the workflow to be retried is important).
AWS Simple Workflow, AWS Step Functions (and the equivalents in other clouds), Netflix’s Conductor, and Uber’s Cadence are examples of systems like this.
So within workflow engines, there are those that persist at the workflow level, and those that persist at the step level.
But how is this persistency implemented? Usually, there will be a database somewhere along the way. At the workflow level, the workflow engine will keep some data on what to execute (important!), as well as the input to that thing. It will enqueue that execution, and some worker process will pick up that data and start executing something with a given input. Once it’s done, the worker process reports success or failure, and if it failed, or the worker was shutdown, or some other catastrophic event happened, the execution data is enqueued again to be retried.
At the step level, things get interesting. The engine needs to persist the workflow execution data for every step along the way, not just at the beginning. On a failure, it knows exactly where the execution stopped, so it can pick up from that point forward. But wait, what is a step? We haven’t defined this, but it’s very important here! The truth though, is that a step really is whatever you want it to be.
What is really executed?
Recall that workflow engines that provide persistency at the workflow level will execute something. That something can even be a binary sitting in your operating system. Usually, though, it’s gonna be close to that - in Sidekiq’s case, it’s a piece of Ruby code that you wrote. In AWS Lambda’s case, it’s a piece of code that you wrote. Because of the persistency level, this works pretty well with any kind of code.
At the step level, things get tricky. If we want to execute some code, what is really a step from the point of view of the engine? Is it every instruction in a binary, or some bytecode to be interpreted? If it’s that, how do you even persist that code execution after every step? Do you just save the content of all registers, the content of all memory mapped for the process running that code? What determines a failure? Is it an instruction or bytecode that throws an error? If a step is a single instruction, how do we even retry steps on failure? That’s useless, the instruction will just throw again!
Because of this and so many other tricky things, workflow engines that persist at the step level usually don’t directly run any kind of code, and they all have their own definition of steps. Those definitions are generally at least semantically similar among many of those engines, but the way the steps are represented definitely isn’t.
Some engine will use a piece of JSON with a certain schema, another will use YAML with a different schema, others define their own fully-fledged DSL, others piggyback on existing languages and give you a library that you use to describe the workflow, in a kind of workflow-definition-as-code (more on this later).
AWS Step Functions tried to standardize this a bit with the Amazon States Language, but all these systems have their own unique ways of working with workflows, and the ASL can’t capture everything. The ASL works if an engine sees workflows as state machines, but there are engines that see workflows as something way more complicated, and others that have a more simplified view.
What’s important to notice is that these engines went a level of abstraction higher to solve the problem of not being able to directly run any kind of code. As a result, now you have to figure out how to break your workflow into steps (whatever is the definition of the engine you’re using), and then describe the workflow as a sequence of those steps, and give that to the workflow engine.
Obviously, one of those steps has to be “execute some code” in some kind of way, otherwise the workflow engine becomes very limited on the kind of things it can do. What’s funny is that “execute some code” may mean executing a smaller workflow in another engine, like AWS Lambda. Crazy.
So now we have a workflow engine that persists stuff at the step level, using some kind of DSL to describe a workflow, which has some steps meant to execute some code, which run in a lower-level system that guarantees workflow level persistency, and that’s usually where most of the work happens.
At this point I can’t help but wonder: why are we doing all of this? At the end of the day, I just want to automate some workflows. But I had to break it down into smaller chunks of code, add an abstraction layer on top it, all so I could guarantee I could execute a workflow step by step, making sure what was already run doesn’t run again. It’s a lot of overhead. But is it really necessary? Up until recently, I’d say it is, but I see some potential to simplify all of this madness.
All of that because we can’t just run any arbitrary piece of code in a realiably persistent manner.
Some might say that writing workflow definitions in whatever DSL is easier than writing code, so it’s more accessible to less technical people. I agree up to a point, but I’ll revisit this later when I touch on the workflow-definition-as-a-code thing.
The ideal workflow engine
If you ask me (and remember I’m focusing only in that middle layer of workflow engines), the ideal solution would be to just write some code and execute it in a workflow engine, which would be able to reliably persist that code execution, as well break down that code into steps, and know which steps can/should be rerun in case of failures.
Unfortunately, it’s quite tricky to do this as I described, because it’s hard to define what a step is only from code that is meant to be executed. However, with some restructuring and helpful hints to whatever compiler/interpreter we’re using, this could be doable enough that the workflow engine will know what are the boundaries between each step.
But what if we just abandon this step thing? Why do we even want to break a workflow in steps, and persist each step? Why not just make every operation idempotent and rerun everything from the beginning in case of failure? After all, Sidekiq exists and it doesn’t care about steps, and surely there are a lot of people using it, so it should be possible to do this, right?
Can we make everything idempotent?
As it turns out, in theory almost every operation can be made idempotent, but in practice things get way harder when we apply this at the workflow level.
It’s also interesting that you really only need to care about idempotency when you need to communicate with some external sources. In practice, a bunch of code you’d want to run as part of a workflow is quite idempotent already: the control flow, math operations, string manipulation. Those are all things that will do the same thing over and over on every execution of the same workflow with the same inputs, so it’s quite alright to just let this code run again. In fact, persisting all of these steps is kind of inefficient (even though many of the step level engines do that). It’s cheaper to just let the processor calculate all of that every time.
Throughout the thinking I’ve done so far, I have identified two “big problem” external sources that require some sort of “idempotency treatment”. One of them is time, the other is the network.
The issue with time
Making everything idempotent and just throwing all of that at a workflow level persistent engine is fine for short-lived workflow executions, but the longer you want your workflows to run, the higher chance you’ll have of that workflow encountering some kind of failure during its lifetime, especially catastrophic stuff like a machine being shutdown.
It’s really not uncommon for certain workflows to have some kind of long wait in them, like days-long, or even month-long waits. And those are real workflows, needed by real humans.
Imagine retrying an entire workflow like this and having to wait all this time again! In most cases, this would even be a disaster.
In practice, what happens is that people often have to introduce their own “step level” persistency as part of their workflow. That way, if a workflow is retried for any particular reason, the code in the workflow would detect how long it has already waited for, and resume from there.
There are other cases where the current time is required (for example for logging, or compliance, or some other reason). In other cases, the code may have communicated with a service already and given it some time, tried to talk to another service, failed, and upon a retry, it passed a completely different time to this other service. If both services assumed the times would be consitent, now you’re in trouble!
Overall, time issues are solvable as long as the code is aware of the problems with time, and the workflow engine provides some additional infrastructure to let the code retrieve whatever time it actually needs to, or the code just persists its own information in some way and retrieves it on every execution.
The issue with the network
These issues are kind of a subset of network issues with distributed systems in general. It’s almost always the case that at some point a workflow will need to talk to another service. What happens if it sends a network request, the service does some work, but the machine running the workflow code is shutdown? The workflow engine will definitely rerun the code, probably in another machine, but now the request might be made once again, and depending on the behavior of the service receiving the request, there will be trouble. This is just an example of what can go wrong during network communication. There are many other examples.
The good news is that in almost all cases, once again, this is solvable by introducing a deterministic unique id on every request, and making sure that the service receiving the request handles idempotency by checking whether it has already processed a request with that unique id.
The bad news which makes this impractical is that depending on the services you want to use as part of your workflow, this unique id thing won’t be supported (and chances are high this is the case), and you may not even be able to change the service’s code and behavior.
Idempotency is just complicated
So, in theory we can make almost everything idempotent. In practice, this will require us either implementing our own step level persistency, or it will be just downright impractical anyway.
So maybe we really need that step level persistency. Maybe that DSL for the workflow definition really solves more problems than it introduces.
Workflow definition as code
Recall when I mentioned that some workflow engines let you define the workflow through code? Well, let’s talk about this.
With the coming of Infrastructure as Code, and a lot of the people using workflow engines already using some other kind of cloud service as well, the natural progression was to allow workflow definitions to be written as code.
This is a good thing in my opinion, because if you’ve ever seen one of those JSON/YAML files with a workflow definition in them, you’d notice that:
- there are lots of places where you can make an error like a mistype (and some of those places can only be evaluated at runtime, letting the error go unnoticed for a while)
- some of those definitions can get ridiculously big. So big that it takes a lot of time to even read and understand what the hell the workflow does.
Quite a few of the workflow engines let you visualize the workflow in some way, usually as some graph. This eases the pain of understanding the workflow, but doesn’t help with everything else.
Workflow-definition-as-code lets you take advantage of the compiler/interpreter to catch certain classes of errors that you would’ve likely made if you wrote the definition by hand, and that is a good thing.
But that argument about less technical people having an easier time with the DSL? It doesn’t hold up. These definitions can get complicated very fast. Moreover, if the ideal workflow engine existed, it would be straightforward to create a DSL that compiles down to code, or create some visual scripting thing that makes it easier to write the code. There are already multiple implementations of both methods (not specifically for workflows, as far as I know) and they are somewhat successful at lowering the barrier of writing code for less technical people. All of this means that the DSL really isn’t an advantage of step level persistent workflow engines.
What about the other way?
At some point in the past, I thought to myself: “What about the other way? Why not let people write code just as they would normally, but instead of compiling to machine code or byte code, why not compile it to a workflow definition?”
That is a compelling thing. It would allow technical people to write code just as they normally would, without having to worry about the mess that is creating, maintaining, and editing a large workflow definition file. It would also allow non-technical people to write code using those tools I mentioned (visual scripting, easier DSL’s), while also having the code eventually become a workflow definition.
It would also allow a much faster development cycle. I haven’t touched on this a lot, but one of the really annoying things with workflow engines, especially the cloud/3rd-party ones, is that it’s very slow to test workflows. The usual development cycle usually goes like this:
- First, you write the workflow definition.
- Then, you need to create a workflow object, or update an existing one in whatever engine you’re using.
- Then you trigger some executions, and inspect their results. It’s very likely there will be errors, so you’ll have to do the usual debugging procedures, and then start over.
If the developer experience around this process isn’t good, these steps could easily take minutes, if not hours. If your workflow needs to talk to tricky services, chances are you may not even want to send real requests to those services, so you’ll have to mock that service as yet another service, make sure it’s running somewhere (because the workflow engine will try to send requests to it), make sure your current workflow definition is pointing at the mock service instead of the real one, do the testing, debug stuff, and at the end remember to switch the definition back to point at the real service. Ugh.
And if you want to skip a service call, or do some testing-specific operation, your workflow definition is now deviating from the real workflow you want to run, so your testing will possibly skip over some important detail you’ll only realize when the real workflow executions start to fail.
Oh, and if you’re using a cloud service (or, more realistically, cloud services) to help with all of that, you’re paying for all of this while you’re just testing things.
Sorry for the digression. Back to where I was: converting real code into workflow definitions would allow a much faster development cycle, because then all you had to do for testing was run the code locally with your usual compiler/interpreter, and make sure everything was running fine. When you were ready, you’d use the workflow-specific compiler/interpreter, and have a workflow definition instead! Sounds easy, right?
I tried that. I worked quite a few weekends (oh, the things you do for passion) trying to get a proof of concept together, and it worked well enough that I saw some interesting value in that. Not everyone else did, though.
At the end of the day, it’s just a really hard thing to do, requiring a different set of skills (more compiler and less web service), but it’s doable.
It was only years later that I’d learn that someone else eventually also had the same idea, including the same approach and the same choice of language, and they got to generally the same amount of progress that I did before (apparently) abandoning the project too (when this post was written, the last commit on that project had been made roughly 3 years earlier).
Why does this all feel wrong to me?
Coming back to that question I keep asking myself: why are we even doing this? We’ve almost come full circle, writing code that becomes a workflow definition to execute the code we initially had, but can’t execute directly in a fault-tolerant, persistent way, because that’s tricky.
We can’t just make everything idempotent because that will require some step level persistency anyway, and writing real code that gets compiled to a workflow definition is a hard task, which gets even harder when you realize that there is really no standard for workflow definitions, so doing that for every workflow engine feels like writing a compiler for multiple architectures.
Is there a better way? Can we simplify all of this and gain our sanity back?
Let’s say I want to create the ideal workflow engine. People give me code to run, and I guarantee their code will run as it is, and if for some catastrophic reason the thing running the code fails, it will automatically start somewhere else from where it stopped.
What are the failure scenarios that I need to be worried about? I identified two categories:
- The thing running the code fails in a catastrophic way (i.e. hardware crashes, machine gets shutdown).
- The thing running the code fails in a silent way (i.e. bit corruption while running, bit corruption when persisting state).
There’s also a scenario where the code itself fails in some way, but I’m not worries about that as a developer of a workflow engine. Treating those cases is just the engine’s job.
Scenarios in the second category are outside the scope of this post. Detecting silent data corruption is a very hard problem, and if you’re interested in recent efforts outside the storage space, these two papers from Meta might be interesting (they’re very easy to read):
To sum up, treating failure scenarios in this second category is something that involves more than just application-level efforts, which is why I won’t talk about it.
Scenarios in the first category can be approached purely at the workflow engine level, which is a good thing. To do that, we’ll have to persist workflow executions at the step level. And because of this, we need to define what a step is.
Defining a step
While playing with that real Python-to-workflow-definition idea, something occurred to me. Naturally, I was converting functions in the Python code to bits of code which would eventually be sent to AWS Lambda. I didn’t know which functions had “side effects” (i.e. which functions were not idempotent), so I had to treat all of them as individual steps and wrap some retries around each one (in case they failed due to some assumed dependency which could throw an error). This is some overhead, but it was the only way to be “safe”.
From this exercise, I arrived at a natural definition of a step: it’s just a function call! This doesn’t help too much, however. First of all, all code is essentially just a bunch of function calls, so this doesn’t get us too far. And without knowing whether a function call is idempotent, we’re stuck with the need to be “safe” and consider everything as non-idempotent. And it’s also possible for things to go wrong within a non-idempotent function call itself in a way that it’s not possible to recover from it.
That’s not too encouraging, but let’s keep going. What if we knew which functions were idempotent, and which were not? It might help us somehow, but is this practical?
The easiest way to do this would be to have whoever created the code annotate each function. It would work, but would also be a huge burden on every person writing code to run on this workflow engine. And because of that, it wouldn’t work very well.
Knowing which functions are idempotent would help us focus on the non-idempotent work, so this exercise of identifying what group a function belongs to is still interesting. Let’s keep this in mind and continue thinking about what we’d do once we have this information.
We’ve defined steps as function calls, but how do we persist them? Assuming we know a function call is idempotent, we can simply choose to avoid persisting anything - in the worst case, we’ll just call it again and we’ll have the same results.
In practice, it gets a bit more complicated than that. Depending on how costly it is to go through each step, we may want to persist results just to avoid going through all the trouble of running the same code again. This is, however, mostly an implementation detail.
The more important thing is how do we persist the workflow execution?
Since we’re running any kind of code, we need to persist the entire state of the execution, which includes register values and used memory. However, there are some things in memory which may not make sense to persist: those are usually file descriptors, bits from shared memory, and other handles to things provided (usually) by an operating system.
I’m ignoring bare metal cases because those by definition already rule out the possibility of running code on top of a workflow engine.
An interesting thing to note is that all the things we wouldn’t want to persist are not specific to the code we’re running - they’re all related to the environment where the code is running. This is a super cool insight, and also one of those “duh” moments. The code starts pure, which means there’s no way for it to have knowledge of any external resources from the beginning.
Let’s ignore things like shared libraries, because those are really mostly an operating system abstraction. We can always just have statically-linked code.
This means that everything in the workflow code given to us is already idempotent! What’s non-idempotent are all the things around the resources provided by the environment. And the code accesses those through… function calls! So we can safely say that functions within the workflow code are idempotent, but all functions provided by the environment are not.
But wait a minute, it can’t really be that simple, right? Of course not.
If we look at this as a whole, from the moment the code makes its first non-idempotent function call to access some external resource, we really can’t provide idempotency guarantees anymore. An example to illustrate this: let’s say I run some code which opens a file. Usually, the function call to open that file will return a file descriptor. I can then make operations on that file using the file descriptor as a handle for the file itself. But if a catastrophic scenario happens, and (after reviving the code somewhere else) we try to continue running the code, it will definitely not work - the machine currently running the revived code wasn’t the same machine that opened the file in the first place, so there’s no way that file descriptor will make any sense to the new machine!
This is a problem because (as far as I know) there was never a system built to avoid a scenario like this.
But since the workflow engine is the one providing all this environment to the code, there’s nothing stopping it from tracking all the external resources the code needs to run! It could persist all of this information along with the code’s registers and memory, and the next machine that revives the code can just prepare the environment to make it exactly the same as before the catastrophic scenario.
This is easier said than done, but I believe it’s completely doable with the correct infrastructure. To support this argument, I’ll briefly discuss two things from Kubernetes’s infrastructure.
Learning about Kubernetes was actually a catalyst that helped me connect all these dots.
Accessing storage from Kubernetes
This one’s quite straightforward - Kubernetes provides access to storage through volumes, and these are almost all decoupled from the actual machine running the code. Which means that Kubernetes has to make some preparations (usually some API calls) to access the storage abstracted by a volume, and then provide that storage to a container.
The ideal workflow engine could do the same thing, going further and disallowing any local files at all (Kubernetes still has support for those, but it explicitly states this is potentially a very bad idea). Since the storage is decoupled from the machines running code, any machine could access the storage (after some proper auth).
For “local file” support, workflow code can just keep things in memory. This approach is the same that Phantom OS takes with files.
Service routing in Kubernetes
This one is a cool concept. Kubernetes allows multiple machines to route traffic between each other. It also supports a load balancer doing this job, but in case a specific machine directly receives a request for a service, that machine can route that traffic to some other machine that is a better fit for handling the request, essentially serving as an impromptu load balancer.
The same approach could be used in certain catastrophic scenarios, guaranteeing that whatever new machine ends up reviving the workflow execution will still be able to use the resources that existed on the previous machine.
This is a rare case because most of these catastrophic scenarios involve the initial machine shutting down, which means access to the resources will be lost anyway, but it’s a neat trick to keep in mind.
Realistically, losing a network connection isn’t too big of a deal because most software is already built with assumptions that network connections are not 100% reliable. Many protocols reflect this assumption by forcing the relevant state to be held at the application level, which means a connection might be lost and retried, and everything would still work. But it’s nice to know that in some cases we can avoid losing a connection.
Running in a virtual machine
Since this workflow engine is also some code running somewhere, chances are that it’ll run on top of an operating system (maybe! I think it might be possible to create this engine as an operating system itself too!). To persist code state (registers, memory), this engine would need to run inside the OS kernel to have access to all of the necessary information.
This makes it a very high-effort and delicate task, but there’s a trick we can use to work around most of these problems: we can just force the workflow code to run inside a virtual machine. Doing this allows the workflow engine to run just like any other code from the perspective of the operating system, while still giving it full control over the workflow execution code.
A big drawback is that this requires a virtual machine to be designed and created, and it’s almost certain that only one or two programming languages would support it.
Some time ago I started playing with Kubernetes, and by chance this led me to read about WebAssembly again, and I started connecting some dots and realizing that WebAssembly fits a lot of the needs of this workflow engine.
It defines a virtual machine, there are already tons of languages supporting it at some level, and I can only see this support improving.
Aside from that, the virtual machine specification forces a pure environment by default, which means there is no way for WebAssembly code to be non-idempotent, unless the host running the WebAssembly code defines non-idempotent functions that can be used. This means the workflow engine can be completely in control of all idempotent code, and also completely in control of any external resources that it provides to workflow execution code.
Aside from that, there is already a lot of people putting effort on improving WebAssembly, which means that (hopefully) the developer experience and the capabilities of the virtual machine will only get better in the future. This makes it a very compelling technology to use for this ideal workflow engine.
I don’t have a lot of experience in WebAssembly right now, but from what I’ve read, I can’t see any issues that absolutely prevent this ideal workflow engine from existing. Yes, there are things which will require effort to be solved, but at least for now, it looks like this engine might be very possible. Heck, I might even start playing with this at some point and see what comes out of it.
A tricky use case
Everything that I wrote here, and many of the things I thought about, had the assumption that at some point the workflow execution ends. This assumption excludes workflows like web servers, which want to run forever.
However, as soon as a web server accepts a connection, expectations are that the connection will end at some point. There might be an opportunity for the workflow engine itself to act as the “run forever” part of a web server, spawning new workflow executions whenever a connection is made. However, I haven’t thought much about this yet.
There might be other use cases which don’t fit some other assumption I made while thinking about this. If I ever think of something else that might not be a good fit, I’ll try to update this part of the post.