Abusing Pod Priority
Killer Coda is a well-known platform that hosts interactive environments for studying cloud native technologies. While doing their CKA scenarios, I found an intriguing one called Scheduling Priority.
Since I was not familiar with Pod Priority or PriorityClass concepts at the time, I did the usual - searched for them in the Kubernetes docs.
At the top of the Pod Priority and Preemption page, we can see a red warning:
Warning:
In a cluster where not all users are trusted, a malicious user could create Pods at the highest possible priority, causing other Pods to be evicted/not get scheduled. An administrator can use ResourceQuota to prevent users from creating pods at high priorities.
See limit Priority Class consumption by default for details.
This message got my attention, and I wanted to see it in action.
Pod Priority#
The feature’s name says it all: Pod Priority is a way to give more importance to some Pods than others. PriorityClasses manage the different levels of priority.
A PriorityClass definition looks like this:
To assign priority to a Pod, the spec of the Pod must contain the field priorityClassName with the correspondent PriorityClass, just as below:
It’s also possible to configure one PriorityClass with globalDefault: true. After that, all new Pods without an explicit priorityClassName will be mutated1 and receive the default priority of the cluster.
After this brief introduction, we are ready to move to the following hands-on sections! 🧪
All commands and configuration files are available in my GitHub repository.
Spin up a cluster, explore and create a deployment#
Let’s start by spinning up a fresh Kubernetes cluster locally with minikube:
With the output of the following command, we can conclude that our cluster has one node:
That node has 4 CPU cores allocatable with 18% already requested:
For this scenario, we will create a virtuous deployment with 2 CPU cores as requests to simulate a well-behaved 👼 application. Notice that we don’t assign a PriorityClass to it:
Speaking of which, do we have PriorityClasses in the cluster?
Yes, we do. Two PriorityClasses: system-cluster-critical and system-node-critical, with the latter being the one with the highest priority. Let’s see if we have Pods with PriorityClasses specified in our cluster:
We have two Pods without PriorityClass and two Pods without CPU requests. We also have one Pod with the lowest PriorityClass defined (system-cluster-critical). Since we didn’t specify a PriorityClass for the virtuous Pod, its priority is zero.
Attack#
Time for the malicious user to get some action. 😈
If we describe our node again:
Making the math (4-2.75), we only have 1.25 cores of CPU available to be requested. What happens if we request 3.3 cores of CPU and use the highest PriorityClass in the cluster? 🐒
The result is:
Both virtuous and coredns Pods terminated, and new ones are now pending! Evil Pod is running.
If you take a closer look, we can understand why. Our evil Pod requests 3.3 cores of CPU and has the highest Pod Priority specified (system-node-critical). Since the only node in the cluster has 1.25 cores of CPU available to be requested, there is a need to kill Pods with lower priority. In this case, since virtuous Pod and coredns were the ones with the lowest priority and with CPU requests specified, the Kubernetes scheduler preempted them.
If the evil Pod didn’t specify a PriorityClass, it would be pending due to a failed schedule.
We didn’t highlight Preemption in the introduction section for drama purposes. PriorityClasses can state their preemptionPolicy, which by default is PreemptLowerPriority, but can also be Never. Both PriorityClasses shipped with Kubernetes clusters (system-cluster-critical and system-node-critical) have the policy PreemptLowerPriority.
Clean up#
Conclusion#
Pod Priority can be useful for some use cases such as prioritizing critical applications, but definitely can catch us off guard if we don’t have the right guardrails in place. This post illustrates potential consequences of not having them.
-
You can read more about mutating admission controllers in a previous post. ↩︎