3 min read

K8s Scheduling: Pod Priority

Some Pods in a cluster are more important than others and therefore, if the cluster is ever facing a resources crunch (such as CPU is reaching its compute limits or memory is getting full), some remediation has to occur to ensure it (the cluster), as a whole, does not crash and burn. In times like these, the Kube-API server will, along with other remedies, pay attention to a Pods Priority.

What is Pod Priority?

Pod Priority allows you to indicate the importance of a Pod relative to other Pods, which affects the order in which Pods are scheduled.

Pod Priority is a native K8s object.

We can create a K8s object of type PriorityClass, just like we create Secrets, ConfigMaps, Deployments i.e. through the use of imperative commands and/or YAML manifests.

In the YAML snippet shown above:

  • name: The name that will be used to refer to this PriorityClass.
  • value: an integer used for assigning priority. The higher the value, the more important the mapped Pods.
  • globalDefault: This is an optional boolean value (true, false).
    • When set to true, any Pod that does not have a PriorityClass object explicitly mapped to it, will take on the priority assigned to this PriorityClass.
    • When set to false, this PriorityClass will not be mapped to any Pod, unless explicitly asked for.
  • description: Another optional attribute that holds a description for the PriorityClass object.

This PriorityClass object can be created by using

$ kubectl create -f <name of manifest>.yaml
💡
Addition of a PriorityClass with globalDefault set to true does not change the priorities of existing Pods. The value of such a PriorityClass is used only for Pods created after the PriorityClass is added.
(from Kubernetes online documentation)

Using PodPriority in a Deployment manifest.

After one or more PriorityClasses have been created, you can create Pods that specify one of those PriorityClass names in their specifications.

Building upon the PriorityClass created earlier, a Pod that wants to inherit the prioritization assigned to ultra-high-priority will declare it in its manifest.

When the Pod declared in the snippet above is ready to be deployed, the kube-scheduler will place it in a queue, with higher priority Pods being sequenced prior to those with lower priority. This may* result in the higher priority Pod being scheduled sooner than Pods with lower priority. If such high priority Pods cannot be scheduled, kube-scheduler will continue and tries to schedule other lower priority Pods.

💡
*may: Remember that Pod priority is one factor taken into consideration by kube-scheduler. Other declared needs like asking to be placed on a specific Node, requiring certain CPU limits or storage needs are also taken into account before a Pod can be instantiated.

Read these articles for more information on techniques used by kube-scheduler for Pod placement.

What happens if there are no Nodes with enough capacity to place a Pod with the ultra-high-priority PriorityClass?

In such a case, the kube-scheduler can preempt/remove lower-priority Pods from Nodes to free up resources and replace them with the ultra-high-priority Pods. By using this tactic, cluster administrators are able to "skip the line" and "show bias" in favor of Pods that are thought of as being more critical workloads.

But I don't want ultra-high-priority Pods to remove lower-priority Pods. Is that possible?

Yes.

With a minor addition to the PriorityClass manifest created earlier, we can ensure that a higher priority Pod is NEVER allowed to displace a lower priority Pod.

A preemptionPolicy of Never tells kube-scheduler to let a Node 'naturally' achieve the state that is required by a high priority Pod as opposed to bumping off lower priority Pods.

Kubernetes online documentation provides a good example to clarify the above:

An example use case is for data science workloads. A user may submit a job that they want to be prioritized above other workloads, but do not wish to discard existing work by preempting running pods. The high priority job with preemptionPolicy: Never will be scheduled ahead of other queued pods, as soon as sufficient cluster resources "naturally" become free.

I write to remember and if in the process, I can help someone learn about Containers, Orchestration (Docker Compose, Kubernetes), GitOps, DevSecOps, VR/AR, Architecture, and Data Management, that is just icing on the cake.