Abusing Pod Priority
Killer Coda is a well-known platform that hosts interactive environments for studying cloud native technologies. While doing their CKA scenarios, I found an intriguing one called Scheduling Priority.
Since I was not familiar with Pod Priority or PriorityClass
concepts at the time, I did the usual - searched for them in the Kubernetes docs.
At the top of the Pod Priority and Preemption page, we can see a red warning:
Warning:
In a cluster where not all users are trusted, a malicious user could create Pods at the highest possible priority, causing other Pods to be evicted/not get scheduled. An administrator can use ResourceQuota to prevent users from creating pods at high priorities.
See limit Priority Class consumption by default for details.
This message got my attention, and I wanted to see it in action.
Pod Priority⌗
The feature’s name says it all: Pod Priority is a way to give more importance to some Pods
than others. PriorityClasses
manage the different levels of priority.
A PriorityClass
definition looks like this:
To assign priority to a Pod
, the spec
of the Pod
must contain the field priorityClassName
with the correspondent PriorityClass
, just as below:
It’s also possible to configure one PriorityClass
with globalDefault: true
. After that, all new Pods
without an explicit priorityClassName
will be mutated1 and receive the default priority of the cluster.
After this brief introduction, we are ready to move to the following hands-on sections! 🧪
All commands and configuration files are available in my GitHub repository.
Spin up a cluster, explore and create a deployment⌗
Let’s start by spinning up a fresh Kubernetes cluster locally with minikube:
With the output of the following command, we can conclude that our cluster has one node:
That node has 4 CPU cores allocatable with 18% already requested:
For this scenario, we will create a virtuous deployment with 2 CPU cores as requests to simulate a well-behaved 👼 application. Notice that we don’t assign a PriorityClass
to it:
Speaking of which, do we have PriorityClasses
in the cluster?
Yes, we do. Two PriorityClasses: system-cluster-critical
and system-node-critical
, with the latter being the one with the highest priority. Let’s see if we have Pods
with PriorityClasses
specified in our cluster:
We have two Pods
without PriorityClass
and two Pods
without CPU requests. We also have one Pod
with the lowest PriorityClass
defined (system-cluster-critical
). Since we didn’t specify a PriorityClass
for the virtuous Pod
, its priority is zero.
Attack⌗
Time for the malicious user to get some action. 😈
If we describe our node again:
Making the math (4-2.75), we only have 1.25 cores of CPU available to be requested. What happens if we request 3.3 cores of CPU and use the highest PriorityClass
in the cluster? 🐒
The result is:
Both virtuous and coredns Pods
terminated, and new ones are now pending! Evil Pod is running.
If you take a closer look, we can understand why. Our evil Pod
requests 3.3 cores of CPU and has the highest Pod Priority specified (system-node-critical
). Since the only node in the cluster has 1.25 cores of CPU available to be requested, there is a need to kill Pods
with lower priority. In this case, since virtuous Pod
and coredns were the ones with the lowest priority and with CPU requests specified, the Kubernetes scheduler preempted them.
If the evil Pod
didn’t specify a PriorityClass
, it would be pending due to a failed schedule.
We didn’t highlight Preemption in the introduction section for drama purposes. PriorityClasses
can state their preemptionPolicy
, which by default is PreemptLowerPriority
, but can also be Never
. Both PriorityClasses
shipped with Kubernetes clusters (system-cluster-critical
and system-node-critical
) have the policy PreemptLowerPriority
.
Clean up⌗
Conclusion⌗
Pod Priority can be useful for some use cases such as prioritizing critical applications, but definitely can catch us off guard if we don’t have the right guardrails in place. This post illustrates potential consequences of not having them.
-
You can read more about mutating admission controllers in a previous post. ↩︎