What I Learned at Work this Week: OOMKilled Kubernetes

Photo by Pixabay: https://www.pexels.com/photo/green-motherboard-163140/

Last week, I learned about how to write a DAG in Airflow. That knowledge came in very handy this week, as about five different Airflow-related bugs were sent my way. One particularly sticky problem was a script that failed five days in a row because of a lack of memory. Fortunately, I got a lot of guidance from a teammate as we worked through it together.

Discovering the error

Ideally, we would learn about a process error because of a log or alert that notifies us when something goes wrong. But I was brand new to the process, so I learned about the error the hard way: the person who needs the CSV that this script generates didn’t get it and asked me why. As we traced the pipeline, I was shown logs that would have informed me of the failure earlier:

Airflow has a pretty user-friendly UI, so after a bit of practice I was able to figure out how to find these logs on my own. We can search for a DAG from the homepage, click into it, and we’ll see what’s called a tree, which splits our job into nodes and displays the status, usually success or failed of each:

At the top, we’ve got the full DAG, so the first row shows us whether it was successful or not (usually not, hence my problem). Below the DAG, we see the start of our processes, the running of my script, and the “end” state. End is usually in yellow, which means “up for retry.” My job failed, so we re-attempted it. After we hit the limits for retry, the DAG failed. If we hover over one of the squares, we can see more detail about the process:

If we click on the square, we’ll see a modal with more options, one of which is Log. Click through there and we’ve got access to the full log. The log for this job was long, so patience was key in parsing through and not getting overwhelmed by things I didn’t understand. At least now I know I can search for certain key words in a failure log, like terminated or reason. When we check reason, we see OOMKilled.

Out of Memory

Once our error is named, the next step is usually to search it on Google. The coworker who helped me found a few resources that explained that OOM stands for Out of Memory, which meant that we’d have to change a setting on our DAG.

Loyal readers might remember that the DAG we created last week used a KubernetesExecutorConfig which would determine CPU and memory usage. In that post, we established that Kubernetes is the platform that manages our containerized workflows, which in this case means the various DAGs we’ve got on Airflow. To Kubernetes, the jobs we’ve been calling DAGs are referred to as containers, which are grouped into pods. Kubernetes uses pod settings to strategically allocate space in nodes that fit into a cluster. So a cluster might have 16GB of RAM, broken up into four nodes which each have 4GB of RAM. If we have three pods that each request 1 core (the C in CPU), they might all be placed in the same node. But if they request 1.5 CPU, at least one will have to move into a second node. Here’s a really helpful graphic from Sysdig:

In this post from Google, I learned that there were two settings we could alter to allocate memory for our containers: requests and limits. The memory request determined how much space must be reserved for the pod on a Kubernetes node. While CPU is parsed out in cores, we use mebibytes (MiB) for memory. They are similar in size to the more commonly referenced megabyte, but they come from a binary system of measurement rather than decimal, so they represent 1,048,576 bytes instead of 1,000,000 (binary means the units increase by exponents with a base of 2. One mebibyte is 2²⁰ bytes). In the example above, we see that the containers are requesting 300 MiB and 100 MiB of memory, respectively.

Remember that Kubernetes is going to try and strategically distribute our pods so that node space is used most efficiently. If we have a node with 1000 MiB of space, Kubernetes might place one 600 MiB pod in it and one 300 MiB pod. If we’re not running a third pod with a 100 MiB (or less) memory request, the remaining memory in the node might start out unaccounted for. But that means that if one of our jobs uses more than the requested amount of memory, it might be able to use some of that extra in order to finish running. That’s where memory limit comes in.

We set both a memory request and a memory limit. Request gives Kubernetes a minimum for allocated memory and limit is our maximum. If the job is using more than that amount of memory, it’ll be killed, even if there’s more space on the node. With this understanding, we could see two potential reasons for our OOMKilled error: the request was too low, or the limit was too low (or both).

Thanks to the ease of Airflow integration, we already had a ready-made Datadog dashboard that tracked CPU and memory usage:

We could see that our job peaked around 6GiB but crashed after half of that (where the line ends). Our memory limit was set around 8,000 MiB and our request was set at around 4,000. In hindsight, this performance makes perfect sense: we had a somewhat flexible memory usage, but because our request was so much lower than our limit, we couldn’t reliably count on Kubernetes allocating that limit to our pod. As memory climbed beyond the request, or job failed, likely because a different job was using the memory we had previously “borrowed” to get up to 6.2 GiB.

The sweet spot

To get this job to run reliably, we had to come close to doubling our memory request and memory limits, and we’re still tweaking the settings based on additional data. We know the risk of being too conservative with these allocations is that certain tasks might just fail to complete. But if we’re too aggressive, we could be using our resources inefficiently and inadvertently costing ourselves more in computing than we should be. My biggest takeaway is that the difference between request and memory was too large. These numbers shouldn’t necessarily be exactly the same, but making them very different allows for too much variance and likely means that we’ll never approach our limit. Thanks to persistence and robust logging, we’ll be able to find the values for optimum performance.

Sources

--

--

--

Solutions Engineer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

UI-Card Design: Blur on Hover with pure CSS

What’s interesting about JS from a Swift developer’s eyes

Lunyr Dev Update — API, Internationalization, Mobile Apps and PC Desktop now available

Big Data Solution with Hadoop, Spark, Jupyter and Docker

Scale your data management by distributing workload and storage on Hadoop and Spark Clusters, explore and transform your data in Jupyter Notebook.

Replacing heycar’s engine without stopping. Our experience breaking up the monolith.

An Undogmatic Approach to Testing

Web Scraping in Python using Beautiful Soup, Requests, Selenium

Basic Project Design in Go

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mike Diaz

Mike Diaz

Solutions Engineer

More from Medium

Must-haves for your Kubernetes Cluster to be Production Ready

How to Model Your Gitops Environments and Promote Releases between Them

Kubernetes Choosing Right Node Size

Why Shouldn’t You Add a Resource Quota Using The Limit Value?