Behind Schedule: Pod Resource Configuration from Beginning to... Huh?

Joe Thompson, Clarity Business Solutions

Getting Into the Railroad Business: Intro and Goals

Before we start, a PSA: it's dry out there

Utah is one of the driest areas of the US on average

SLC is also 4300 feet above sea level

HYDRATE like your life depends on it!

There are usually water bottles available as booth swag or at the CNCF store

Who's that at the podium?

Almost 30 years in IT (Kubernetes-enabled since 2015)

Past employers: HashiCorp, Mesosphere, CoreOS, Red Hat, among others; currently a Consulting Engineer for Clarity Business Solutions

Pronouns: he/him
Blood type: Caffeine-positive
Pop-culture references center of gravity: around 1989

How to get in touch:
What are we here to talk about?
  • The seemingly-simple behavior of the kube-scheduler has some surprises in store
  • It's not, however, without its own logic -- you just have to understand what that logic is

Shipping Freight: The Very Basics of Resources

Requests and limits
  • Requests are a floor for resource allocation -- the minimums that must be satisfied to schedule the pod
  • Limits are a ceiling of what the pod's containers will be allowed to consume
  • Requested resources that are unused are wasted
  • Resources are compressible or incompressible
    • Exceeding limits on incompressible resources like RAM triggers container termination
    • Containers that try to exceed compressible resource limits like CPU are throttled
Quality of service
  • Quality of service (QoS) is set for you based on requests and limits
  • In normal operation, it has no effect* on the operation of the pod
  • Pods can be Best-Effort, Burstable, or Guaranteed*
Priority
  • Priority is user-controllable and can be positive, negative or zero
  • Normally referred to by name rather than number
  • By default there are two very-high-priority classes for node-critical and cluster-critical services
  • Pods get a priority of zero by default
  • Like QoS, priority doesn't affect normal pod operation
Sounds easy, right?
  • Well, first it's going to get complicated
  • Then it's going to get weird

Too Much Freight: Clusters Under Pressure

Things get complicated: Eviction and Preemption
  • Eviction is triggered by the kubelet on its own when the node comes under pressure*
  • * There is an eviction request API mechanism as well
  • Preemption is triggered by the scheduler when a high-priority pod can't be scheduled normally
Things get complicated: Scheduling is not a promise
  • Sometimes a pod that triggered preemption still doesn't get immediately scheduled if a higher-priority pod hits the queue first
  • All we are promised about a pod that is evicted or preempted is that it will go back in the scheduling queue
  • It may stay in the queue arbitrarily long, depending on what else is going on in the cluster
  • You can often buy your way out of this with some form of autoscaling
Things get weird: "Better" requests may be worse
  • Being more conservative with your requests can get your pods evicted more often
  • Pods that exceed their requests are first out the door when eviction is triggered
  • Setting your requests to the absolute floor can backfire
Things get weird: preemption's Don Draper moment
  • Eviction uses priority as part of the selection logic, but preemption doesn't use QoS
  • A high-priority BestEffort pod may survive preemption while a low-priority Guaranteed one is booted
Things get weird: preemption math
  • Before choosing pods to preempt on a candidate node, the scheduler asks "If all lower-priority pods were terminated, could the queued pod be scheduled?"
  • When other scheduling mechanisms like pod affinity or anti-affinity are in play, the scheduler may still not preempt pods
Things get weird: Eviction's Don Draper moment
  • Preemption will try to respect PDBs (it will go ahead and violate them if it has to though)
  • Node-pressure eviction doesn't care about PDBs or even terminationGracePeriodSeconds
    • API-initiated eviction does respect PDBs

Freight Expediting: How To Manage the Mess

Keep it simple
  • Make sure you can reason about the pods in your cluster and how they will get scheduled
  • Don't go wild creating new PriorityClasses or overly-complex affinity rules
  • Consider whether your resource management needs dictate separate node groups or even separate clusters
Use PDBs, but don't count on them
  • We've already seen instances in this talk where PDBs are not respected
  • PDBs won't save you from involuntary disruptions
Tune eviction thresholds on your nodes
  • BE VERY CAREFUL DOING THIS
  • Eviction thresholds are set conservatively by default and if you tune them poorly, your node availability can be affected
Test your clever ideas thoroughly
  • Everybody has a test environment
  • If you don't have one, yes you do -- where is it?
If all else fails, throw money at the problem
  • Remember, in normal operation, none* of this matters and everything has enough resources to schedule
  • The Karpenter autoscaler will actually try to choose nodes based on your workloads' needs
  • Bear in mind that "cluster autoscaling is connected directly to your credit card"

A New Era Arrives: What's New and What's Coming

Some things that matter for this are changing
  • In-place pod resizing
  • Dynamic resource allocation
  • Live pod migration?!

Model Railroading: A Quick Demo

Sign For Delivery, Please: Wrapping up

Final thoughts
  • Don't Panic
  • When you run into stuff that seems weird, it's probably because it is weird
  • Begin with the assumption that you're going to figure it out (although that might not mean fixing it)
Further reading and resources
Thank you!
Feedback: