Professional Services, Managed Services

Kubernetes in production — Pod Disruption Budget

July 23, 2018

How to manage disruptions in Kubernetes? Setting a proper RollingUpdate strategy specs solves only one type of disruption.
What about other disruptions?

Voluntary and Involuntary Disruptions

Pods do not disappear until someone (a person or a controller) destroys them, or there is an unavoidable hardware or system software error.

We call these unavoidable cases involuntary disruptions to an application. Examples are:

a hardware failure of the physical machine backing the node
cluster administrator deletes VM (instance) by mistake
cloud provider or hypervisor failure makes VM disappear
a kernel panic
the node disappears from the cluster due to cluster network partition
eviction of a pod due to the node being out-of-resources.

Except for the out-of-resources condition, all these conditions should be familiar to most users; they are not specific to Kubernetes.

We call other cases voluntary disruptions. These include both actions initiated by the application owner and those initiated by a Cluster Administrator. Typical application owner actions include:

deleting the deployment or other controller that manages the pod
updating a deployment’s pod template causing a restart
directly deleting a pod (e.g. by accident)

Cluster Administrator actions include:

Draining a node for repair or upgrade.
Draining a node from a cluster to scale the cluster down (learn about Cluster Autoscaling ).
Removing a pod from a node to permit something else to fit on that node.

Dealing with Disruptions

Here are some ways to mitigate involuntary disruptions:

Ensure your pod requests the resources it needs.
Replicate your application if you need higher availability. (Learn about running replicated stateless and stateful applications.)
For even higher availability when running replicated applications, spread applications across racks (using anti-affinity) or across zones (if using a multi-zone cluster.)

PodDisruptionBudget

An Application Owner can create a PodDisruptionBudget object (PDB) for each application. A PDB limits the number pods of a replicated application that are down simultaneously from voluntary disruptions. For example, a quorum-based application would like to ensure that the number of replicas running is never brought below the number needed for a quorum. A web front end might want to ensure that the number of replicas serving load never falls below a certain percentage of the total.

Cluster managers and hosting providers should use tools which respect Pod Disruption Budgets by calling the Eviction API instead of directly deleting pods. Examples are the kubectl drain command and the Kubernetes-on-GCE cluster upgrade script (cluster/gce/upgrade.sh).

When a cluster administrator wants to drain a node they use the kubectl drain command. That tool tries to evict all the pods on the machine. The eviction request may be temporarily rejected, and the tool periodically retries all failed requests until all pods are terminated, or until a configurable timeout is reached.

Example PDB Using minAvailable:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: zk-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: zookeeper

Example PDB Using maxUnavailable (Kubernetes 1.7 or higher):

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: zk-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: zookeeper

Helm

use this in your Chart! templates/pdb.yaml:


apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
 name: 
 namespace: 
 labels:
 app: 
 chart: -
 release: 
 heritage: 
spec:
 selector:
 matchLabels:
 app: 
 env: 
 minAvailable:

Imagine you have a service with 2 replicas and you need at least 1 to be available even during node upgrades and other ops tasks.

install / upgrade your release:

helm upgrade --install --debug "$RELEASE_NAME" -f helm/values.yaml \ 
--set replicas=2,budget.minAvailable=1 myrepo/mychart

kubectl describe pdb “$RELEASE_NAME”

Name: mysvc-prod
Namespace: prod
Min available: 1
Selector: app=myservice,env=prod
Status:
 Allowed disruptions: 1
 Current: 2
 Desired: 1
 Total: 2
Events: <none>

drain a node with one of your pods running:

kubectl drain --delete-local-data --force --ignore-daemonsets gke-mycluster-prod-pool-2fca4c85-k6g5node
 "gke-mycluster-prod-pool-2fca4c85-k6g5" already cordoned
WARNING: Deleting pods with local storage: sqlproxy-67f695889d-t778w; 
Ignoring DaemonSet-managed pods: fluentd-gcp-v3.0.0-llp5s; 
Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet:
 kube-proxy-gke-testing-dev-pool-2fca4c85-k6g5
pod "tiller-deploy-7b7b795779-rcvkd" evicted
pod "mysvc-prod-6856d59f9b-lzrtf" evicted
node "gke-mycluster-prod-pool-2fca4c85-k6g5" drained

again: kubectl describe pdb “$RELEASE_NAME”

Name: mysvc-prod
Namespace: prod
Min available: 1
Selector: app=myservice,env=prod
Status:
 Allowed disruptions: 0
 Current: 1
 Desired: 1
 Total: 2
Events: <none>

Tadaaa! We drained a node without any disruptions of our service.

PDB with 1 replica only?

If we had 1 replica only, the kubectl drain would get stuck always. Node drains / upgrades would need to be solved manually.

You might expect the eviction API would try to surge a replica to comply with the minAvailable condition, instead the drain gets stuck and it is your responsibility to solve this situation by yourself. Is it a bug or a feature? The Kubernetes community says you shouldn’t use 1 replica in production at all if you want HA, which is fair :)

It does what is expected of it, though.

If you don’t want your kubectl drains to get stuck, you might want to use PDB for deployments with more than 1 replica.

Edit your templates/pdb.yaml:



apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
...

How to perform Disruptive Actions on your Cluster

If you are a Cluster Administrator, and you need to perform a disruptive action on all the nodes in your cluster, such as a node or system software upgrade, here are some options:

Accept downtime during the upgrade. Fail over to another complete replica cluster.

No downtime, but may be costly both for the duplicated nodes, and for human labour to orchestrate the switchover.

Write disruption tolerant applications and use PDBs.

No downtime.
Minimal resource duplication.
Allows more automation of cluster administration.
Writing disruption-tolerant applications is tricky, but the work to tolerate voluntary disruptions largely overlaps with work to support autoscaling and tolerating involuntary disruptions.

Sources:
https://kubernetes.io/docs/concepts/workloads/pods/disruptions

FAQs

Q1: What are the two main types of disruptions for applications in Kubernetes?

The two types are involuntary disruptions, which are unavoidable cases like hardware or system software errors, and voluntary disruptions, which are actions initiated by an application owner or a Cluster Administrator.

Q2: Can you provide examples of involuntary disruptions?

Yes, examples include a hardware failure on a physical machine, a VM disappearing due to cloud provider failure, a kernel panic, a node disappearing from the cluster due to a network partition, or a pod being evicted because a node is out of resources.

Q3: What are some examples of voluntary disruptions?

Actions by an application owner, such as deleting a deployment, updating a pod’s template causing a restart, or accidentally deleting a pod. Actions by a Cluster Administrator include draining a node for repair, upgrading, or scaling down the cluster.

Q4: What is a PodDisruptionBudget (PDB) and what is its purpose?

A PodDisruptionBudget (PDB) is a Kubernetes object an application owner can create for each application. Its purpose is to limit the number of pods in a replicated application that can be simultaneously down as a result of voluntary disruptions.

Q5: How does a PDB protect an application during a planned maintenance event like a node drain?

When a Cluster Administrator uses a tool like kubectl drain, it tries to evict pods from the node. If evicting a pod would violate the rules set in its PDB (for example, by bringing the available replica count below the minAvailable threshold), the eviction request is temporarily rejected. The tool will periodically retry until the pod can be safely removed without violating the budget.

Q6: What are the two main ways to configure a PDB’s availability requirements?

You can configure a PDB with either minAvailable, which specifies the minimum number of pods that must be running, or maxUnavailable, which specifies the maximum number of pods that can be down at one time.

Q7: What happens if you apply a PDB to a deployment that has only one replica?

The kubectl drain command will get stuck. The eviction API will not automatically create a new replica to satisfy the PDB’s minAvailable condition, and the situation must be resolved manually. It is recommended to only use PDBs for deployments with more than one replica.

Q8: For a cluster-wide upgrade, what are the different strategies for handling disruptions?

There are three main options:

Accept downtime during the upgrade.
Fail over to a complete replica cluster, which involves no downtime but can be costly.
Write disruption-tolerant applications and use PDBs, which allows for no downtime, minimal resource duplication, and more automation.

Marek Bartík

Marek is a NoOps/NoCode enthusiast. Starting as a C++ programmer while doing masters in Computer Systems and Networks, growing up in the SysAdmin era, quickly realized communication and collaboration is the key. Nowadays he focuses on Cloud Architecting, microservices and Continuous Everything to solve business problems, not technical ones. Marek is passionate about DevOps and Cloud Native.