Behind Schedule: Pod Resource Configuration from Beginning to... Huh?

Joe Thompson, Clarity Business Solutions

Getting Into the Railroad Business: Intro and Goals

Before we start, a PSA: it's dry out there

Utah is one of the driest areas of the US on average

SLC is also 4300 feet above sea level

HYDRATE like your life depends on it!

There are usually water bottles available as booth swag or at the CNCF store

Who's that at the podium?

Almost 30 years in IT (Kubernetes-enabled since 2015)

Past employers: HashiCorp, Mesosphere, CoreOS, Red Hat, among others; currently a Consulting Engineer for Clarity Business Solutions

Pronouns: he/him
Blood type: Caffeine-positive
Pop-culture references center of gravity: around 1989

How to get in touch:

Kubernetes Slack: @kensey
LinkedIn

What are we here to talk about?

The seemingly-simple behavior of the kube-scheduler has some surprises in store
It's not, however, without its own logic -- you just have to understand what that logic is

Shipping Freight: The Very Basics of Resources

Requests and limits

Requests are a floor for resource allocation -- the minimums that must be satisfied to schedule the pod
Limits are a ceiling of what the pod's containers will be allowed to consume
Requested resources that are unused are wasted
Resources are compressible or incompressible
- Exceeding limits on incompressible resources like RAM triggers container termination
- Containers that try to exceed compressible resource limits like CPU are throttled

Quality of service

Quality of service (QoS) is set for you based on requests and limits
In normal operation, it has no effect* on the operation of the pod
Pods can be Best-Effort, Burstable, or Guaranteed*

Priority

Priority is user-controllable and can be positive, negative or zero
Normally referred to by name rather than number
By default there are two very-high-priority classes for node-critical and cluster-critical services
Pods get a priority of zero by default
Like QoS, priority doesn't affect normal pod operation

Sounds easy, right?

Well, first it's going to get complicated
Then it's going to get weird

Too Much Freight: Clusters Under Pressure

Things get complicated: Eviction and Preemption

Eviction is triggered by the kubelet on its own when the node comes under pressure*
* There is an eviction request API mechanism as well
Preemption is triggered by the scheduler when a high-priority pod can't be scheduled normally

Things get complicated: Scheduling is not a promise

Sometimes a pod that triggered preemption still doesn't get immediately scheduled if a higher-priority pod hits the queue first
All we are promised about a pod that is evicted or preempted is that it will go back in the scheduling queue
It may stay in the queue arbitrarily long, depending on what else is going on in the cluster
You can often buy your way out of this with some form of autoscaling

Things get weird: "Better" requests may be worse

Being more conservative with your requests can get your pods evicted more often
Pods that exceed their requests are first out the door when eviction is triggered
Setting your requests to the absolute floor can backfire

Things get weird: preemption's Don Draper moment

Eviction uses priority as part of the selection logic, but preemption doesn't use QoS
A high-priority BestEffort pod may survive preemption while a low-priority Guaranteed one is booted

Things get weird: preemption math

Before choosing pods to preempt on a candidate node, the scheduler asks "If all lower-priority pods were terminated, could the queued pod be scheduled?"
When other scheduling mechanisms like pod affinity or anti-affinity are in play, the scheduler may still not preempt pods

Things get weird: Eviction's Don Draper moment

Preemption will try to respect PDBs (it will go ahead and violate them if it has to though)
Node-pressure eviction doesn't care about PDBs or even terminationGracePeriodSeconds
- API-initiated eviction does respect PDBs

Freight Expediting: How To Manage the Mess

Keep it simple

Make sure you can reason about the pods in your cluster and how they will get scheduled
Don't go wild creating new PriorityClasses or overly-complex affinity rules
Consider whether your resource management needs dictate separate node groups or even separate clusters

Use PDBs, but don't count on them

We've already seen instances in this talk where PDBs are not respected
PDBs won't save you from involuntary disruptions

Tune eviction thresholds on your nodes

BE VERY CAREFUL DOING THIS
Eviction thresholds are set conservatively by default and if you tune them poorly, your node availability can be affected

Test your clever ideas thoroughly

Everybody has a test environment
If you don't have one, yes you do -- where is it?

If all else fails, throw money at the problem

Remember, in normal operation, none* of this matters and everything has enough resources to schedule
The Karpenter autoscaler will actually try to choose nodes based on your workloads' needs
Bear in mind that "cluster autoscaling is connected directly to your credit card"

A New Era Arrives: What's New and What's Coming

Some things that matter for this are changing

In-place pod resizing
Dynamic resource allocation
Live pod migration?!

Model Railroading: A Quick Demo

Sign For Delivery, Please: Wrapping up

Final thoughts

Don't Panic
When you run into stuff that seems weird, it's probably because it is weird
Begin with the assumption that you're going to figure it out (although that might not mean fixing it)