The seemingly-simple behavior of the kube-scheduler has some surprises in store
It's not, however, without its own logic -- you just have to understand what that logic is
Shipping Freight: The Very Basics of Resources
Requests and limits
Requests are a floor for resource allocation -- the minimums that must be satisfied to schedule the pod
Limits are a ceiling of what the pod's containers will be allowed to consume
Requested resources that are unused are wasted
Resources are compressible or incompressible
Exceeding limits on incompressible resources like RAM triggers container termination
Containers that try to exceed compressible resource limits like CPU are throttled
Quality of service
Quality of service (QoS) is set for you based on requests and limits
In normal operation, it has no effect* on the operation of the pod
Pods can be Best-Effort, Burstable, or Guaranteed*
Priority
Priority is user-controllable and can be positive, negative or zero
Normally referred to by name rather than number
By default there are two very-high-priority classes for node-critical and cluster-critical services
Pods get a priority of zero by default
Like QoS, priority doesn't affect normal pod operation
Sounds easy, right?
Well, first it's going to get complicated
Then it's going to get weird
Too Much Freight: Clusters Under Pressure
Things get complicated: Eviction and Preemption
Eviction is triggered by the kubelet on its own when the node comes under pressure*
* There is an eviction request API mechanism as well
Preemption is triggered by the scheduler when a high-priority pod can't be scheduled normally
Things get complicated: Scheduling is not a promise
Sometimes a pod that triggered preemption still doesn't get immediately scheduled if a higher-priority pod hits the queue first
All we are promised about a pod that is evicted or preempted is that it will go back in the scheduling queue
It may stay in the queue arbitrarily long, depending on what else is going on in the cluster
You can often buy your way out of this with some form of autoscaling
Things get weird: "Better" requests may be worse
Being more conservative with your requests can get your pods evicted more often
Pods that exceed their requests are first out the door when eviction is triggered
Setting your requests to the absolute floor can backfire
Things get weird: preemption's Don Draper moment
Eviction uses priority as part of the selection logic, but preemption doesn't use QoS
A high-priority BestEffort pod may survive preemption while a low-priority Guaranteed one is booted
Things get weird: preemption math
Before choosing pods to preempt on a candidate node, the scheduler asks "If all lower-priority pods were terminated, could the queued pod be scheduled?"
When other scheduling mechanisms like pod affinity or anti-affinity are in play, the scheduler may still not preempt pods
Things get weird: Eviction's Don Draper moment
Preemption will try to respect PDBs (it will go ahead and violate them if it has to though)
Node-pressure eviction doesn't care about PDBs or even terminationGracePeriodSeconds
API-initiated eviction does respect PDBs
Freight Expediting: How To Manage the Mess
Keep it simple
Make sure you can reason about the pods in your cluster and how they will get scheduled
Don't go wild creating new PriorityClasses or overly-complex affinity rules
Consider whether your resource management needs dictate separate node groups or even separate clusters
Use PDBs, but don't count on them
We've already seen instances in this talk where PDBs are not respected
PDBs won't save you from involuntary disruptions
Tune eviction thresholds on your nodes
BE VERY CAREFUL DOING THIS
Eviction thresholds are set conservatively by default and if you tune them poorly, your node availability can be affected
Test your clever ideas thoroughly
Everybody has a test environment
If you don't have one, yes you do -- where is it?
If all else fails, throw money at the problem
Remember, in normal operation, none* of this matters and everything has enough resources to schedule
The Karpenter autoscaler will actually try to choose nodes based on your workloads' needs
Bear in mind that "cluster autoscaling is connected directly to your credit card"
A New Era Arrives: What's New and What's Coming
Some things that matter for this are changing
In-place pod resizing
Dynamic resource allocation
Live pod migration?!
Model Railroading: A Quick Demo
Sign For Delivery, Please: Wrapping up
Final thoughts
Don't Panic
When you run into stuff that seems weird, it's probably because it is weird
Begin with the assumption that you're going to figure it out (although that might not mean fixing it)