Behind Schedule: Pod Resource Configuration from Beginning to... Huh?

Demo: Preemption failure due to pod affinity

Note: Due to time constraints, my demo at KubeCon Salt Lake City was a little abbreviated -- it showed the mechanics of the problem, but not very realistically. This one is slightly more elaborate (and a little more robust to deploy) and shows it better.

Setup

To run the demo you want a cluster with a reasonably-sized node that has at least 1.5 to 2 GiB of free memory. You'll need the following files:

base-workload.yaml high-priority.yaml

Depending on which solution you want to try, you'll want one or more additional files, which will be linked in each solution section.

Deploy the low-priority base workload pods

The first deployment is a single replica of a small, labeled "marker" pod that will land on a random node, along with a scalable deployment that has affinity for our marker pod:

$ kubectl apply -f base-workload.yaml

After applying this manifest you'll want to scale the base-workload deployment. The number of replicas will depend on your node size -- each replica requests 256 MiB of memory, and you want to scale it so there is just under 1 GiB (and at least 512 MiB) of allocatable memory on the node running these pods. To find the node these are running on:

$ kubectl get pods -o jsonpath='{$.items[].spec.nodeName}{"\n"}' \ -l app=kubecon-demo,component=marker-pod

Then to find out the amount of free memory on the node:

$ kubectl describe node [your node name] | grep -A 7 "Allocated resources"

And finally, to scale the deployment:

$ kubectl scale deployment base-workload --replicas [number of replicas]

Deploy the high-priority workload with affinity

The second deployment is the high-priority pod that will initially fail to schedule:

$ kubectl apply -f high-priority.yaml

Observe the preemption failure

At this point, list the pods:

$ kubectl get pods -l app=kubecon-demo

You should notice that everything but the high-priority workload has scheduled; for example:

NAME                             READY   STATUS    RESTARTS   AGE
base-workload-9c4b4bf97-5mqvx    1/1     Running   0          5m20s
base-workload-9c4b4bf97-5rjf5    1/1     Running   0          5m20s
base-workload-9c4b4bf97-g7l86    1/1     Running   0          5m20s
base-workload-9c4b4bf97-rlqgj    1/1     Running   0          5m20s
base-workload-9c4b4bf97-t27kk    1/1     Running   0          5m2s
base-workload-9c4b4bf97-v8mw8    1/1     Running   0          8m40s
high-priority-569db858cc-q6gxd   0/1     Pending   0          7s   
marker-pod-56d89b6d94-rw9zw      1/1     Running   0          8m41s

Describing the high-priority pod should tell you that preemption failed:

$ kubectl describe pods -l app=kubecon-demo,component=high-priority Name: high-priority-569db858cc-q6gxd Namespace: default Priority: 20241113 Priority Class Name: kubecon-demo Service Account: default [...] Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 6m18s default-scheduler 0/1 nod es are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 node(s) didn't match pod affinity rules. Warning FailedScheduling 60s default-scheduler 0/1 nod es are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 node(s) didn't match pod affinity rules.

Resolve the issue

Resolution 1: Resize the high-priority workload

The easiest way to resolve this is to downsize the high-priority workload to fit in the available space. The file high-priority-resized.yaml contains a version of the deployment that only requests 512 MiB of memory:

$ kubectl delete deployment -l app=kubecon-demo,component=high-priority $ kubectl apply -f high-priority-resized.yaml

(We could also downsize the base workload, but the math is easier for resizing just one pod.)

Resolution 2: Remove the pod affinity from the high-priority workload

If there's no affinity tying the high-priority pod to the lower-priority ones, some of them can now be preempted because preempting all of them no longer makes the high-priority pod unschedulable on the node -- or, if there are multiple nodes, the high-priority pod can now land on another node. The file high-priority-no-affinity.yaml contains a version of the deployment with no pod affinity:

$ kubectl delete deployment -l app=kubecon-demo,component=high-priority $ kubectl apply -f high-priority-no-affinity.yaml

Non-resolution: Redeploy the low-priority app with higher priority

In my KubeCon talk I mentioned in passing the idea of raising the priority of the base workload. At the time I was thinking off-the-cuff that if the base workload is the same priority as the one we're trying to deploy, then it no longer falls into the "all lower priority pods" bucket and some of those pods can be preempted to make room for our high-priority workload -- but of course pods of equal priority won't be preempted!. (However, note that even if this worked, it would put the remaining lower-priority pods on the node at risk of preemption.) The file base-workload-high-priority.yaml contains a version of the deployment with the same priority as the high-priority workload:

$ kubectl delete deployment -l app=kubecon-demo,component=high-priority $ kubectl delete deployment -l app=kubecon-demo,component=base-workload $ kubectl apply -f base-workload-high-priority.yaml

Scale the deployment as you did before, then deploy the high-priority pod:

$ kubectl apply -f high-priority.yaml

Note that nothing is preempted, though the events on the pending pod are slightly different:

$ kubectl describe po -l app=kubecon-demo,component=high-priority [...] Warning FailedScheduling 6m19s default-scheduler 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 I nsufficient memory.

Verify things are working

After implementing your chosen solution, repeat the pod listing:

$ kubectl get pods -l app=kubecon-demo

You should notice that the high-priority pod is now scheduled. (Note that one or more of the base-workload pods may be Pending depending on the solution you implemented and how many nodes you have.)

Clean up

Remove all the resources deployed in the demo:

$ kubectl delete deployment -l app=kubecon-demo $ kubectl delete priorityclass kubecon-demo