Optimizing the eviction of code containers from disk

I’m working on a system to run code on demand ( starting thoughts ), and so I’m always thinking about the current and future architectures of this system, things that may happen in the future and ways I can react to them, possible ways to expand the system, how I can further simplify what I already have, and so on.

At the end of August, I came across Juncheng Yang’s post about his work on S3-FIFO, a new cache eviction algorithm, and made me think about one of the parts of the system I’m building (I’ll also refer to it as the system throughout this post).

In the system, machines receive some input to work on, along with a reference to the code that needs to run. To do the work, they need to fetch the code from a repository and then run that code with the given input, and for faster access on subsequent invocations, that code stays in the machine’s local disk. Throughout the system’s lifetime, given that it’ll get many different pieces of code to run over time, a machine will run out of disk space if old code isn’t removed. This resembles cache eviction, and I wondered whether I could adapt S3-FIFO to work in this scenario.

Since the system functions similarly to every other Function-as-a-Service system out there, I did some light searches for publications on this eviction mechanism, but couldn’t find anything interesting, even after getting some good pointers from Juncheng. Almost every result talked about the optimization of cold-start latencies (in most cases by keeping the code running), which is related to the scenario I had in mind, but still not the exact thing: the system still has to fetch code in certain cases, and optimising this particular step is what I’m interested in.

Dynamics of evicting code

When a machine receives a signal to run some code, first it needs to fetch the code. The machine stores the code on disk, and keeps it there while the code runs. It could also keep the code in-memory and remove it from disk (or never write it there to begin with), but in practice this behaviour would only help in rare cases, while making all normal cases less efficient.

Given that the system knows that a machine has code already cached on disk, it can route extra invocations with the same code to that same machine, because it won’t need to fetch the code again. This “optimisation” (more like common sense tbh) improves the performance of the system: code starts up faster and the system minimises the amount of code pulls from a repository, thus better utilising the network with other data.

The need for eviction arises because at some point we need to remove older code that hasn’t run in a while so we can make space for newer code in the machine’s local disk. We also need to be careful not to remove code that is running on the machine, even if it hasn’t ran for a while — this could happen if the machine is running a ton of quick invocations with different code, or if there’s a long-running piece of code that ran only once. This means that for some period of time, we have entries in our cache that we just can’t remove at all.

I haven’t found eviction algorithms that work with this constraint, so I set out to experiment with adaptations that introduce this constraint in existing algorithms, including S3-FIFO.

While working on these adaptations, I also figured out that I can approach this problem in a different way: since we don’t want to evict running code, we can wait until it has finished running before adding it to the eviction queue (it will already be in the cache, just not eligible for eviction yet). This means that the size of the eviction queue becomes dynamic: if the queue is at 10GB and a signal comes in to run some code weighing 1GB, the queue shrinks to 9GB. Similarly, if some code weighing 1GB finishes running, the queue grows by 1GB and gets instantly filled with the code that just finished running. This allows the use of existing eviction algorithms without introducing a new constraint. Dynamically resizing the queue size seems to be easier than making existing items unremovable, because reducing the queue size means we just evict more items until the queue is at or below the target size, and increasing the size is almost instant with not a lot of extra work needed.

Given that we can’t remove code from local disk until they’re finished running, it’s possible for the machine’s disk to overfill: it can hold enough pieces of code running at the same time that we can’t add any more, because that would exceed its size. This is not something that a machine can solve — the system doing the work distribution should must avoid this issue.

The way that the system distributes work between machines affects how well code is reused. Because of that, it’s interesting to also test the system as a whole when comparing the performance of cache eviction algorithms, but at this point the test will measure a blend of things (the work distribution algorithm and the cache eviction). Since I wanted to scratch this itch specifically about cache eviction (and because I don’t really have a good work distribution system yet), I assumed no work distribution and experimented with cache eviction on a single machine (i.e. a single machine running all invocations). Because of this, for each experiment I had to calculate the minimum size required for the cache so I could finish the experiment without overfilling the machine’s local disk. The minimum size is equal to the largest sum of disk space used by invocations at any time.

Finding some data for experiments

The system I’m working on is still not in production (which means no real-world data to experiment with), but I wanted to be as close to production-like traffic as possible. After digging through the Internet, I also found zero datasets that had everything I needed to run some experiments.

I needed a database with an identifier for the code/function to run, a timestamp indicating when it started, duration (or a timestamp indicating when it finished), and the size of the code. This would let me simulate all invocations using each cache eviction adaptation I wrote (when I talk about the results from my experiments in a later section, this is essentially what an experiment consists of). Most datasets provided the first 3 pieces of data I needed, and maybe some extra information such as CPU and memory usage, which didn’t help me that much regarding code size. The best “open” dataset that I found was the Azure public dataset , which is far from open.

That dataset contains two collections that interested me: a trace of function invocations from 2021, and a larger, summarized collection of function invocations from 2019.

The data from 2021 turned out not interesting at all. After initial runs with a simple FIFO adaptation, I decided to dig more into the data and discovered that there were only 119 applications in it (“application” is Azure’s term for a deployable piece of code), and they were often running all the time, so I achieved a 99.8% code reuse rate with the simplest adaptation I came up with.

This left me with the 2019 dataset, which was harder to process because the data was all summarized, so I had to do something about it.

Hydrating the 2019 dataset

The 2019 dataset doesn’t contain per-invocation data. Instead, it contains percentiles of execution times for each day that was part of the data capture period. And those percentiles are actually percentiles of an average of 30 seconds of invocation-by-invocation metrics, which seems insane to me (percentiles of an average, really?). I don’t know how Azure isn’t ashamed of releasing a dataset like this, but I guess I should at least be thankful that they gave me something I could use to try to mimic production-like traffic. But I had to find a way to convert those percentiles into usable data.

A couple years ago, I was writing a distributed systems simulator after a chat with a former coworker, and I ended up needing to simulate individual metrics given specific percentile profiles. Here’s how it would work: you’d input the p50, p90, p95, and other such metrics for every component in the system, and the simulator would simulate individual requests back-and-forth. You’d then be able to play with the parameters to better understand the failure conditions of the system as a whole, as well as find the weakest links in it, and thus interesting targets for optimisation. I ended up abandoning the simulator (perhaps in the future I can pick it up again!), but the code I wrote back then became useful now: I had percentiles from Azure’s dataset, and needed to come up with a way to simulate the individual invocations, so I could throw them at the different cache eviction algorithms I wanted to experiment with.

The code uses metalog distributions to go from percentile values to a distribution that can generate values following the original percentile data. I used the bounded metalog equations to get to a quantile function, since I have the minimum and maximum metrics from the Azure dataset.

The Azure dataset captured averages rather than the raw data, so the values I get from the distribution should also be interpreted as averages rather than actual execution times. I decided to skip a step and interpret the averages as if they were actual invocation times, because there’s so much I can do with the values in the dataset without introducing even more anomalies (by e.g. trying to go from averages to individual times with yet another model).

The dataset doesn’t include the disk size of any application that it runs, so I also had to improvise here: I decided to use the maximum memory usage as a silly and naive heuristic to represent the size of the code on disk. After inspecting the values I got from this, I decided they were good enough for these experiments, because it followed a distribution that I’d also expect from the size of code pieces in the system I’m developing.

After all of this, hydrating the values for every invocation required more time and disk than I expected — the dataset contains over 12 billion invocations that were summarised. The final file (with an invocation per line) weighted over 1.2TB once the hydration code finished running.

The adaptations

I had a plan to implement 5+ different algorithms with adaptations. I started with plain FIFO adaptations, and it got 95%+ cache hit rates on initial results. After implementing S3-FIFO adaptations, this number went up further (check later sections for actual numbers). It took me a few weekends of free time to finish all of this, and by that time I started to lose interest in adding more algorithms, since my itch was already scratched: I had a pretty good idea that with what I already had, the system would work sufficiently well (at least from this perspective of cache eviction). I could go back to focusing on finishing the system, and later think about optimising it further, because at that point I’d have data from the system itself to experiment with, so I ended up with only 2 adaptations for 2 algorithms each.

The original algorithms see each item in the cache as having the same size: if a cache has size X, it can hold X items in it. In practice, each item has a different size, so one of the adaptations I made was to add dynamic item sizes in the algorithm and make the cache size dynamic, so items would only be added to the cache after they were done executing. I came up with the term size-aware to identify these adaptations.

Recall that another approach is to add items to the cache, but prevent them from being evicted while the item is in use by the machine. This constraint became a second adaptation that I made, which I named size- and time-aware, because items can’t be evicted from the cache for some duration. They’re added to the cache before the execution starts.

Size-aware FIFO adaptation

This adaptation is straightforward:

As a consequence of this adaptation, the queue may remove more than one item before adding another. The queue size is also dynamic, so if the queue needs to be shrunk, we remove items from it until its total size is below the target size.

Size- and time-aware FIFO adaptation

Items have a size and a flag that indicates whether they can be removed from the queue. In my implementation, the flag is evaluated comparing the timestamp when the item expires and the current timestamp inside the simulation, because we know the start and end timestamps of every invocation.

When we want to add an item to the queue:

  1. If the queue has space to hold the item, add it and end, otherwise keep going.
  2. Remove the next item from the queue that indicates it can be removed. If we find no item like this, panic.
  3. Go back to step 1.

As I mentioned in an earlier section, the queue could overfill because of the time constraint, which is why I precalculated the minimum queue size that would avoid this.

Size-aware S3-FIFO adaptation

In this adaptation, when we need to add an item to S3-FIFO, we ensure there is enough space by evicting as many items as needed to free up space. Everything else works the same way.

Shrinking the queue works similarly: we just evict items until its total size is below the target size.

Size- and time-aware S3-FIFO adaptation

When evicting items from S, we perform the following:

When evicting items from M, we perform the following:

Due to how S3-FIFO partitions its queue size into S and M, and eagerly moves things from S to M, it’s possible to run into cases where we need to insert an item into M, but M is already full of items, and none can be evicted due to the time constraint flag. Thus, the minimum size of the queue (for experiment purposes) needs to apply to M’s size instead of S3-FIFO’s size. Because of this, S3-FIFO with this adaptation needs to be bigger than its FIFO counterpart. To make things fair when evaluating S3-FIFO against FIFO, I increased the size of the FIFO queue to be the total size of S3-FIFO.

Evaluation

The following is a table of the cache hit rates of all adaptations after running on the full 1.2TB of invocation data. I used different names in the code, so to link them to the adaptations in this post:

I used “naive” in the names purely because that was my first idea for an adaptation of S3-FIFO.

AdaptationCache hit rate
fifo_timed95.1858%
s3fifo_naive_timed96.8998%
fifo_sized96.5972%
s3fifo_naive_sized96.0786%

The following graph shows the cache hit rates for different amounts of data used for the experiment. The amount of invocation data is a bit weird, as I’m measuring in disk size of the text representation of invocations instead of measuring in # of invocations, but this was the quickest way I found to get smaller partitions of invocation data to test with.

Cache hit rates per amount of invocation data used. The hit rates trend up on all adaptations.

The size- and time-aware S3-FIFO adaptation ends up outperforming every other one. The size-aware FIFO adaptation is the second best approach, and given its simplicity compared to size- and time-aware S3-FIFO, I’d say it performs surprisingly well. I think that other changes to size- and time-aware S3-FIFO could improve its performance, but that’s something for the future.

To understand how much the size of the cache influences the performance of the adaptations, I ran a series of experiments with varying cache sizes. I started with 1.1x the minimum cache size required for the experiment (to account for S3-FIFO’s higher cache size requirements), and ended up at 2x the minimum cache size required.

Cache hit rates based on cache size. The hit rates trend up on all adaptations.

The size-aware S3-FIFO adaptation, which didn’t perform as well as the size- and time-aware one, gets a big improvement with bigger cache sizes (still not enough to beat size- and time-aware). I believe this is due to how the core S3-FIFO algorithm works. It just manages to use its queues really well.

Things for the future

For now, I’ve had fun playing with this problem and am satisfied with the solutions I found, so I’ll go back to focusing on building the system as a whole. At some point I’d like to add more algorithms to this, but only after I have a system running some real-life traffic.

I’d also like to experiment with different adaptations of size- and time-aware S3-FIFO. Some ideas I have:

If you found this interesting, would like to chat about it or chat about anything else really, feel free to drop me an email . I’m also interested in running more experiments if you find more data that fits the scenario I described in this post, so let me know if you have more data!