Abusing Pod Priority
Killer Coda is a well-known platform that hosts interactive environments for studying cloud native technologies. While doing their CKA scenarios, I found an intriguing one called Scheduling Priority.
Since I was not familiar with Pod Priority or PriorityClass
concepts at the time, I did the usual - searched for them in the Kubernetes docs.
At the top of the Pod Priority and Preemption page, we can see a red warning:
Warning:
In a cluster where not all users are trusted, a malicious user could create Pods at the highest possible priority, causing other Pods to be evicted/not get scheduled. An administrator can use ResourceQuota to prevent users from creating pods at high priorities.
See limit Priority Class consumption by default for details.
This message got my attention, and I wanted to see it in action.
Pod Priority⌗
The feature’s name says it all: Pod Priority is a way to give more importance to some Pods
than others. PriorityClasses
manage the different levels of priority.
A PriorityClass
definition looks like this:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
preemptionPolicy: PreemptLowerPriority
value: 1000
globalDefault: false
description: This priority class should be used for X pods only.
To assign priority to a Pod
, the spec
of the Pod
must contain the field priorityClassName
with the correspondent PriorityClass
, just as below:
apiVersion: v1
kind: Pod
metadata:
labels:
run: test
name: test
spec:
containers:
- image: busybox
name: test
restartPolicy: Always
priorityClassName: high-priority
It’s also possible to configure one PriorityClass
with globalDefault: true
. After that, all new Pods
without an explicit priorityClassName
will be mutated1 and receive the default priority of the cluster.
After this brief introduction, we are ready to move to the following hands-on sections! 🧪
All commands and configuration files are available in my GitHub repository.
Spin up a cluster, explore and create a deployment⌗
Let’s start by spinning up a fresh Kubernetes cluster locally with minikube:
minikube start
With the output of the following command, we can conclude that our cluster has one node:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
minikube Ready control-plane 5s v1.24.3
That node has 4 CPU cores allocatable with 18% already requested:
kubectl describe node minikube
# truncated
Allocatable:
cpu: 4
ephemeral-storage: 61255492Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
hugepages-32Mi: 0
hugepages-64Ki: 0
memory: 8039920Ki
pods: 110
# truncated
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 750m (18%) 0 (0%)
memory 170Mi (2%) 170Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
hugepages-32Mi 0 (0%) 0 (0%)
hugepages-64Ki 0 (0%) 0 (0%)
# truncated
For this scenario, we will create a virtuous deployment with 2 CPU cores as requests to simulate a well-behaved 👼 application. Notice that we don’t assign a PriorityClass
to it:
kubectl apply -f virtuous-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: virtuous
name: virtuous
spec:
replicas: 1
selector:
matchLabels:
app: virtuous
template:
metadata:
labels:
app: virtuous
spec:
containers:
- command:
- sh
- -c
- sleep 1d
image: busybox
name: virtuous
resources:
requests:
cpu: 2
Speaking of which, do we have PriorityClasses
in the cluster?
kubectl get priorityclass
NAME VALUE GLOBAL-DEFAULT AGE
system-cluster-critical 2000000000 false 32s
system-node-critical 2000001000 false 32s
Yes, we do. Two PriorityClasses: system-cluster-critical
and system-node-critical
, with the latter being the one with the highest priority. Let’s see if we have Pods
with PriorityClasses
specified in our cluster:
kubectl get pods --all-namespaces -o custom-columns=POD:.metadata.name,PRIORITY_CLASS:.spec.priorityClassName,PRIORITY:.spec.priority,CPU_REQUESTS:'.spec.containers[*].resources.requests.cpu'
POD PRIORITY_CLASS PRIORITY CPU_REQUESTS
virtuous-7bc68f75fb-xpjz2 <none> 0 2
coredns-6d4b75cb6d-htbdx system-cluster-critical 2000000000 100m
etcd-minikube system-node-critical 2000001000 100m
kube-apiserver-minikube system-node-critical 2000001000 250m
kube-controller-manager-minikube system-node-critical 2000001000 200m
kube-proxy-2zgz7 system-node-critical 2000001000 <none>
kube-scheduler-minikube system-node-critical 2000001000 100m
storage-provisioner <none> 0 <none>
We have two Pods
without PriorityClass
and two Pods
without CPU requests. We also have one Pod
with the lowest PriorityClass
defined (system-cluster-critical
). Since we didn’t specify a PriorityClass
for the virtuous Pod
, its priority is zero.
Attack⌗
Time for the malicious user to get some action. 😈
If we describe our node again:
kubectl describe node minikube
#truncated
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 2750m (68%) 0 (0%)
memory 170Mi (2%) 170Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
hugepages-32Mi 0 (0%) 0 (0%)
hugepages-64Ki 0 (0%) 0 (0%)
#truncated
Making the math (4-2.75), we only have 1.25 cores of CPU available to be requested. What happens if we request 3.3 cores of CPU and use the highest PriorityClass
in the cluster? 🐒
kubectl apply -f evil-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: evil
name: evil
spec:
replicas: 1
selector:
matchLabels:
app: evil
template:
metadata:
labels:
app: evil
spec:
containers:
- command:
- sh
- -c
- sleep 1d
image: busybox
name: evil
resources:
requests:
cpu: 3.3
priorityClassName: system-node-critical
The result is:
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default evil-7568b8bbf5-4fq5d 1/1 Running 0 37s
default virtuous-7bc68f75fb-pf95j 0/1 Pending 0 37s
kube-system coredns-6d4b75cb6d-85x7r 0/1 Pending 0 37s
kube-system etcd-minikube 1/1 Running 0 114s
kube-system kube-apiserver-minikube 1/1 Running 0 114s
kube-system kube-controller-manager-minikube 1/1 Running 0 114s
kube-system kube-proxy-2zgz7 1/1 Running 0 102s
kube-system kube-scheduler-minikube 1/1 Running 0 114s
kube-system storage-provisioner 1/1 Running 1 (71s ago) 113s
Both virtuous and coredns Pods
terminated, and new ones are now pending! Evil Pod is running.
If you take a closer look, we can understand why. Our evil Pod
requests 3.3 cores of CPU and has the highest Pod Priority specified (system-node-critical
). Since the only node in the cluster has 1.25 cores of CPU available to be requested, there is a need to kill Pods
with lower priority. In this case, since virtuous Pod
and coredns were the ones with the lowest priority and with CPU requests specified, the Kubernetes scheduler preempted them.
If the evil Pod
didn’t specify a PriorityClass
, it would be pending due to a failed schedule.
We didn’t highlight Preemption in the introduction section for drama purposes. PriorityClasses
can state their preemptionPolicy
, which by default is PreemptLowerPriority
, but can also be Never
. Both PriorityClasses
shipped with Kubernetes clusters (system-cluster-critical
and system-node-critical
) have the policy PreemptLowerPriority
.
Clean up⌗
minikube delete
Conclusion⌗
Pod Priority can be useful for some use cases such as prioritizing critical applications, but definitely can catch us off guard if we don’t have the right guardrails in place. This post illustrates potential consequences of not having them.
-
You can read more about mutating admission controllers in a previous post. ↩︎