Managing Pod Memory and Preventing those OOMKills

tldr - engineers spend too much time managing the cpu and memory of their kubernetes pods. Here's a guide to make this common task easier, with some suggestions for shortcuts and alternatives.

Platform engineering pushes engineers to manage and monitor their Pod Memory

It’s 2024 and the platform engineering community has aligned on a few premises:

Kubernetes is the standard orchestrator for an enterprise’s diverse workloads
Central platform teams make Kubernetes accessible by building golden paths and standardizing cloud resources (node pools, load balancers, etc)
Code owners should be able to manage their SLAs, availability and cloud costs

Does your organization agree?

If so, you and your team need to manage the resources your workloads are using on Kubernetes. Specifically, you need to actively manage the resource limits and requests of the pods used by your workloads.

If your workload is CPU starved, you’ll hit performance and scheduling issues.
If your pod doesn’t have enough Memory, Kubernetes will terminate the container and restart the pod (the dreaded OOMKilled).
Spare CPU and/or spare Memory? You’re inflating your cloud bill and there’s a chance that your greedy pods are causing problems for someone else on your team.

In short, code owners need to manage pod resource consumption in order to hit their SLAs and avoid wasteful spending.

At Flightcrew we want to delegate this sort of task to AI, but if you insist on doing it manually here’s a guide for how to find and optimize how much Memory is used by your pods.

We’ll use sock-shop as an example microservice app on Kubernetes.

Step 0: Brush up on Kubectl

Kubectl is the native CLI for communicating with Kubernetes, it provides basic commands for accessing metadata, metrics, and logs so that you can check on your pods without bugging an SRE (or having an SRE bug you).

To be self sufficient, read the kubectl docs and ensure your kubectl config current-context is set to the correct Kubernetes cluster.

Your main tools are kubectl get, kubectl describe and kubectl top commands.
You can use '-A' if you don't know which namespace to find something. Use this to your advantage to get familiar with your cluster!
There are several ways to format the resource that you want to get ('resource-type/resource-name', 'deployment/front-end'), and several resources types have a shorthand, which 'kubectl' will tell you how to find by doing 'kubectl get' or 'kubectl describe' without anything else following ('deploy/front-end').
You can find resources by labels that you put on your workloads and pods: 'kubectl get pods -l name=front-end -n sock-shop'

Step 1: Find the Pods & Nodes where your code is running

Kubernetes is designed so that large enterprises can run diverse workloads with a high degree of redundancy and customization. This flexibility means it can be difficult to know where your code is running and how to maintain it.

As a refresher, you’ll need to work with the following objects

Pods are groups of containers with shared resources. They're the smallest deployable unit in Kubernetes.
Nodes are the physical or virtual machines where pods run.
Workloads are higher-level abstractions to manage pods at scale. You’ll probably manage pods - with a deployment.
Namespaces allow you to isolate objects within your cluster based on team, environment, etc

Kubernetes Cluster Architecture

Your first step is to find the pods where your container is running, and identify any workload objects that are controlling said pods.

If you used recommended labels, then it’s it trivial to find your pods with kubectl get

❯ kubectl get pods -l name=front-end -A
NAMESPACE   NAME                         READY   STATUS    RESTARTS   AGE
sock-shop   front-end-57f45b79fc-h5xzj   1/1     Running   0          12d

Or if you know what your pod is generally going to be named …

❯ kubectl get pod –all-namespaces | grep front-end
sock-shop                front-end-57f45b79fc-h5xzj                              1/1     Running            0                  12d

However, if you’ve neglected labels then you’ll have to do some bruteforce checking

❯ kubectl get pod –all-namespaces
NAMESPACE                NAME                                                    READY   STATUS             RESTARTS           AGE
argocd                   argocd-application-controller-0                         1/1     Running            0                  12d
argocd                   argocd-applicationset-controller-79d54f5d64-rgtf9       1/1     Running            0                  12d
argocd                   argocd-dex-server-7f4b99d696-9ln8s                      1/1     Running            0                  12d
argocd                   argocd-notifications-controller-cdd87f4f5-tc65x         1/1     Running            0                  12d
argocd                   argocd-redis-74d77964b-l7w2f                            1/1     Running            0                  12d
…

At Flightcrew we love using the kubectl krew plugin which shows a tree based on the .metadata.ownerReferences that trace a pod back to the object that manages it

❯ kubectl tree deployment/front-end -n sock-shop
NAMESPACE  NAME                                READY  REASON  AGE
sock-shop  Deployment/front-end                -              476d
sock-shop  ├─ReplicaSet/front-end-57f45b79fc   -              333d
sock-shop  │ └─Pod/front-end-57f45b79fc-h5xzj  True           5d16h
sock-shop  ├─ReplicaSet/front-end-74c6cb7766   -              390d
sock-shop  ├─ReplicaSet/front-end-7b66ff8446   -              452d
sock-shop  ├─ReplicaSet/front-end-7d89d49d6b   -              476d
sock-shop  └─ReplicaSet/front-end-9b64b6d49    -              452d

And then to find the node

❯ kubectl get pod -n sock-shop front-end-57f45b79fc-h5xzj -ojson | jq '.spec.nodeName'
"gke-sandbox-dev-default-pool-fed3013f-awph"

So kubectl can tell you the pods and nodes where your workload is running, and whether those pods are being controlled by a deployment or other intermediate object.

Step 2: Check Pod health

Now it’s time to use kubectl to check how our pods are doing. Once more this is easy with kubectl get

❯ kubectl get pods -l name=front-end -A
NAMESPACE   NAME                         READY   STATUS    RESTARTS   AGE
sock-shop   front-end-57f45b79fc-h5xzj   1/1     Running   0          12d

sh
❯ kubectl get pod -n sock-shop carts-645b945d94-qlvfp
NAME                     READY   STATUS    RESTARTS           AGE
carts-645b945d94-qlvfp   1/1     Running   3044 (5m48s ago)   12d

What are you looking for?

'Running' in the status column is good. 100% Ready is also good. Restarts? Bad!

If you're having issues, the status and restarts will give you clues. In this case, the carts pod has 3044 restarts, and the most recent one was 5m48s ago.

To dig in, you can use kubectl describe:

❯ kubectl describe pod -n sock-shop carts-645b945d94-qlvfp
Name:             carts-645b945d94-qlvfp
Namespace:        sock-shop
Priority:         0
Service Account:  default
Node:             gke-sandbox-dev-default-pool-fed3013f-hk56/10.128.0.92
Start Time:       Thu, 05 Sep 2024 01:40:23 -0700
Labels:           name=carts
                  pod-template-hash=645b945d94
Annotations:      <none>
Status:           Running
....
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 17 Sep 2024 12:21:37 -0700
      Finished:     Tue, 17 Sep 2024 12:22:26 -0700
    Ready:          True
    Restart Count:  3044
....

If you see OOMKilled, then you've probably solved your mystery. This means the oom killer process is terminating your memory starved pods. Very bad!

If you see any other error, looking in the Events at the bottom of the kubectl describe can give hints (ImagePullBackoff, etc.). You can use kubectl logs to see if an error caused by code caused the container to Error.

Step 3: Find How Much Memory each Pod is Using

You now know whether or not your pods are healthy - next let’s check usage patterns to confirm if resource utilization is causing you to crash or overpay.

The quick way to find current memory usage for pods and nodes is to use 'kubectl top pod name -n namespace' or 'kubectl top node name' to check current resource usage. Note that the kubectl top command only works for pods in State: Running. Keep reading for a few tips for managing crashing pods.

❯ kubectl top pod carts-645b945d94-qlvfp -n sock-shop
NAME                     CPU(cores)   MEMORY(bytes)
carts-645b945d94-qlvfp   15m          15Mi

sh
❯ kubectl top node gke-sample-default-pool-fed3013f-12j8
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
gke-sandbox-dev-default-pool-fed3013f-12j8   149m         15%    1529Mi          54%

However, just getting the current Memory and CPU usage isn't very helpful. What if there are spikes? Or a sudden increase in usage? You want to set resource limits and requests for longer, representative usage patterns.

Your observability tool (Datadog, Prometheus) should have resource utilization metrics, but you can check cpu and memory utilization yourself by using 'watch' in the CLI to refresh the usage periodically

For example, to check every 5 seconds:

watch -n5 "kubectl top pod carts-645b945d94-qlvfp -n sock-shop"

What are you looking for?

Well after making sure nothing is obviously wrong (ex: a memory leak), try to find the median and peak utilization for your workload. You want to build a mental model of your past resource consumption so that you can correctly estimate future allocation.

Step 4: Kubernetes Requests, Limits and Resource Contention

You now understand the usage patterns for your workload and have some sense of ‘correct’ resource allocation.

But before you change the size of your pod, it’s a courtesy (if not a requirement) to see if there’s space on the Node. If Pods are equivalent to containers, then you can think of Nodes as Hosts.

Kubernetes has many tools to help manage resource contention. For now you’ll want to make sure there’s enough space for your desired larger pod. If not, you could OOM again, and cause problems for other workloads on the node.

❯ kubectl top node
NAME                                         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY% 
gke-sandbox-dev-default-pool-fed3013f-12j8   204m         21%    1502Mi          53%     
gke-sandbox-dev-default-pool-fed3013f-171i   502m         53%    1394Mi          49%     
gke-sandbox-dev-default-pool-fed3013f-awph   251m         26%    1246Mi          44%     
gke-sandbox-dev-default-pool-fed3013f-hk56   205m         21%    1550Mi          55%     
gke-sandbox-dev-default-pool-fed3013f-o7dh   148m         15%    1208Mi          43%     
gke-sandbox-dev-default-pool-fed3013f-qh11   621m         66%    928Mi           33%     
gke-sandbox-dev-default-pool-fed3013f-we02   605m         64%    1197Mi          42%     
gke-sandbox-dev-default-pool-fed3013f-xbuv   130m         13%    1075Mi          38%

If there's not enough space, find out whether your cluster has cluster autoscaler. If not, use your cloud provider to provision an additional node.

Expressing your needs

Kubernetes has a flexible system for managing pod resource allocation.

Memory and CPU are your Kubernetes Resources
Resource Requests are your Pod’s steady-state allocation of CPU and Memory
Resource Limits are the ‘hard cap’ enforced by the kubelet & container runtime

Kubernetes Pod Limits & Requests

There are various debates online about the true best practice for how and if you should assign these values. Right now your workload is getting killed, so let’s increase our Memory Request.

How much should you ask for?

In this case, our ‘carts’ workload was measuring at 15Mi before it crashed, so we can increase the Memory limit by 15Mi until it's no longer crashing. Once the pod is healthy, you can always go back to Step 3 to find true resource consumption and eliminate spare CPU and Memory.

Ask and you shall receive

Using GitOps and IaC to manage Kubernetes resources is the best practice, but if you want to manually change your configs through kubectl, you can run kubectl edit. Keep in mind from Step 1 that if your pod is managed by a different objectt, you have to edit the top-level object.

In this case, our carts pod is owned by a deployment

❯ kubectl tree deployment/carts -n sock-shop
NAMESPACE  NAME                            READY  REASON              AGE
sock-shop  Deployment/carts                -                          483d
sock-shop  └─ReplicaSet/carts-645b945d94   -                          483d
sock-shop    └─Pod/carts-645b945d94-qlvfp  False  ContainersNotReady  12d

Let's first check how much resources our deployment allocated for our crashing pod.

❯ kubectl get deployment/carts -n sock-shop -oyaml | yq '.spec.template.spec.containers[0].resources' -o yaml
limits:
  cpu: 3m
  memory: 5Mi # Memory Limit is HERE!
requests:
  cpu: 1m
  memory: 2Mi

So we currently have 5Mi as our limit, but earlier we saw a maximum usage of 15Mi. Let's increase the memory limit!

So, we’ll run

kubectl edit deployment carts -n sock-shop

to open the config in your editor of choice (set in 'KUBE_EDITOR' or 'EDITOR' environment variables)

…
spec:
  replicas: 1
  selector:
    matchLabels:
      name: carts
  template:
    metadata:
      labels:
        name: carts
    spec:
      containers:
        image: weaveworksdemos/carts:0.4.8
        name: carts
        resources: # Edit the numbers HERE!!
          limits:
            cpu: 3m
            memory: 20Mi # Make this higher than 15Mi, keep increasing if we need to.
          requests:
            cpu: 1m
            memory: 2Mi
…

There you have it, your pod should have the Memory it needs to stay online.

How long does it take to update Pod Memory & CPU?

To recap, we ….

Connected to your cluster and brushed up on kubectl docs
Found the pods where our code is running, and checked to see if another object (ex: a deployment) was managing these pods
Checked if our pods were starved on Memory or CPU
Checked trends in resource utilization to estimate ‘correct’ resource capacity for our workload
Made sure our Node has enough ‘room’ - and then updated our pod to the correct size

This ‘Pod sizing’ workflow is a routine task that should only take 1-2 few hours. That said it could take more time if:

You’re unfamiliar with the workload, its labels and traffic patterns
You’re unfamiliar with Pods, Nodes and Kubernetes resource usage
You’re not used to thinking like an SRE: capacity, buffers, etc
You’re reducing resources and so there’s risk of oomkill, CPU throttling, etc
You’re working on a critical workload and need to be careful

When a team moves onto Kubernetes, each engineer loses 2.2 hours a week of ‘coding time’ to manual toil. Kubernetes resource management is one of these random tasks which can interrupt your day.

Shortcuts & Alternatives for Managing Pod CPU & Memory

Resource management is a ubiquitous problem on kubernetes and so there are many tools that can help you monitor and manage resource usage:

The AWS, GCP and Azure Kubernetes dashboards offer basic insights into resource health.
Install Kube-State-Metrics (KSM) and Prometheus for basic metrics and visibility
Vertical Pod Autoscaling will automatically update kubernetes memory requests and limits based on usage patterns … but this can be disruptive for many types of workloads.
You can run Vertical Pod Autoscaling in ‘Recommender’ mode, or use Goldilocks for basic resource request estimates.
Have your platform team build and deploy autoscaling policies for each workload.

Let Flightcrew worry about Pod CPU and Memory

Flightcrew is an AI tool that helps engineers automate infrastructure and IAC.

One of our first applications is using Flightcrew to rightsize pods and nodes like dependabot.

Flightcrew continuously analyzes resource utilization of your infrastructure, generates Pull Requests to fix OOMs or waste, and warns you when a configuration change could break something important.

Send us a note to try it out :)