Killer Coda is a well-known platform that hosts interactive environments for studying cloud native technologies. While doing their CKA scenarios, I found an intriguing one called Scheduling Priority.
Since I was not familiar with Pod Priority or
PriorityClass concepts at the time, I did the usual - searched for them in the Kubernetes docs.
At the top of the Pod Priority and Preemption page, we can see a red warning:
In a cluster where not all users are trusted, a malicious user could create Pods at the highest possible priority, causing other Pods to be evicted/not get scheduled. An administrator can use ResourceQuota to prevent users from creating pods at high priorities.
See limit Priority Class consumption by default for details.
This message got my attention, and I wanted to see it in action.
The feature’s name says it all: Pod Priority is a way to give more importance to some
Pods than others.
PriorityClasses manage the different levels of priority.
PriorityClass definition looks like this:
apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: high-priority preemptionPolicy: PreemptLowerPriority value: 1000 globalDefault: false description: This priority class should be used for X pods only.
To assign priority to a
spec of the
Pod must contain the field
priorityClassName with the correspondent
PriorityClass, just as below:
apiVersion: v1 kind: Pod metadata: labels: run: test name: test spec: containers: - image: busybox name: test restartPolicy: Always priorityClassName: high-priority
It’s also possible to configure one
globalDefault: true. After that, all new
Pods without an explicit
priorityClassName will be mutated1 and receive the default priority of the cluster.
After this brief introduction, we are ready to move to the following hands-on sections! 🧪
All commands and configuration files are available in my GitHub repository.
Spin up a cluster, explore and create a deployment⌗
Let’s start by spinning up a fresh Kubernetes cluster locally with minikube:
With the output of the following command, we can conclude that our cluster has one node:
kubectl get nodes
NAME STATUS ROLES AGE VERSION minikube Ready control-plane 5s v1.24.3
That node has 4 CPU cores allocatable with 18% already requested:
kubectl describe node minikube
# truncated Allocatable: cpu: 4 ephemeral-storage: 61255492Ki hugepages-1Gi: 0 hugepages-2Mi: 0 hugepages-32Mi: 0 hugepages-64Ki: 0 memory: 8039920Ki pods: 110 # truncated Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 750m (18%) 0 (0%) memory 170Mi (2%) 170Mi (2%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) hugepages-32Mi 0 (0%) 0 (0%) hugepages-64Ki 0 (0%) 0 (0%) # truncated
For this scenario, we will create a virtuous deployment with 2 CPU cores as requests to simulate a well-behaved 👼 application. Notice that we don’t assign a
PriorityClass to it:
kubectl apply -f virtuous-deploy.yaml
apiVersion: apps/v1 kind: Deployment metadata: labels: app: virtuous name: virtuous spec: replicas: 1 selector: matchLabels: app: virtuous template: metadata: labels: app: virtuous spec: containers: - command: - sh - -c - sleep 1d image: busybox name: virtuous resources: requests: cpu: 2
Speaking of which, do we have
PriorityClasses in the cluster?
kubectl get priorityclass
NAME VALUE GLOBAL-DEFAULT AGE system-cluster-critical 2000000000 false 32s system-node-critical 2000001000 false 32s
Yes, we do. Two PriorityClasses:
system-node-critical, with the latter being the one with the highest priority. Let’s see if we have
PriorityClasses specified in our cluster:
kubectl get pods --all-namespaces -o custom-columns=POD:.metadata.name,PRIORITY_CLASS:.spec.priorityClassName,PRIORITY:.spec.priority,CPU_REQUESTS:'.spec.containers[*].resources.requests.cpu'
POD PRIORITY_CLASS PRIORITY CPU_REQUESTS virtuous-7bc68f75fb-xpjz2 <none> 0 2 coredns-6d4b75cb6d-htbdx system-cluster-critical 2000000000 100m etcd-minikube system-node-critical 2000001000 100m kube-apiserver-minikube system-node-critical 2000001000 250m kube-controller-manager-minikube system-node-critical 2000001000 200m kube-proxy-2zgz7 system-node-critical 2000001000 <none> kube-scheduler-minikube system-node-critical 2000001000 100m storage-provisioner <none> 0 <none>
We have two
PriorityClass and two
Pods without CPU requests. We also have one
Pod with the lowest
PriorityClass defined (
system-cluster-critical). Since we didn’t specify a
PriorityClass for the virtuous
Pod, its priority is zero.
Time for the malicious user to get some action. 😈
If we describe our node again:
kubectl describe node minikube
#truncated Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 2750m (68%) 0 (0%) memory 170Mi (2%) 170Mi (2%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) hugepages-32Mi 0 (0%) 0 (0%) hugepages-64Ki 0 (0%) 0 (0%) #truncated
Making the math (4-2.75), we only have 1.25 cores of CPU available to be requested. What happens if we request 3.3 cores of CPU and use the highest
PriorityClass in the cluster? 🐒
kubectl apply -f evil-deploy.yaml
apiVersion: apps/v1 kind: Deployment metadata: labels: app: evil name: evil spec: replicas: 1 selector: matchLabels: app: evil template: metadata: labels: app: evil spec: containers: - command: - sh - -c - sleep 1d image: busybox name: evil resources: requests: cpu: 3.3 priorityClassName: system-node-critical
The result is:
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE default evil-7568b8bbf5-4fq5d 1/1 Running 0 37s default virtuous-7bc68f75fb-pf95j 0/1 Pending 0 37s kube-system coredns-6d4b75cb6d-85x7r 0/1 Pending 0 37s kube-system etcd-minikube 1/1 Running 0 114s kube-system kube-apiserver-minikube 1/1 Running 0 114s kube-system kube-controller-manager-minikube 1/1 Running 0 114s kube-system kube-proxy-2zgz7 1/1 Running 0 102s kube-system kube-scheduler-minikube 1/1 Running 0 114s kube-system storage-provisioner 1/1 Running 1 (71s ago) 113s
Both virtuous and coredns
Pods terminated, and new ones are now pending! Evil Pod is running.
If you take a closer look, we can understand why. Our evil
Pod requests 3.3 cores of CPU and has the highest Pod Priority specified (
system-node-critical). Since the only node in the cluster has 1.25 cores of CPU available to be requested, there is a need to kill
Pods with lower priority. In this case, since virtuous
Pod and coredns were the ones with the lowest priority and with CPU requests specified, the Kubernetes scheduler preempted them.
If the evil
Pod didn’t specify a
PriorityClass, it would be pending due to a failed schedule.
We didn’t highlight Preemption in the introduction section for drama purposes.
PriorityClasses can state their
preemptionPolicy, which by default is
PreemptLowerPriority, but can also be
PriorityClasses shipped with Kubernetes clusters (
system-node-critical) have the policy
Pod Priority can be useful for some use cases such as prioritizing critical applications, but definitely can catch us off guard if we don’t have the right guardrails in place. This post illustrates potential consequences of not having them.