Scaling in Google Kubernetes Engine

One of the main advantages of hosting application in the cloud is definitely the ability to scale resources in reaction to growing traffic or predictable traffic peaks that can be specific to your business. With on-premises solutions, you have no choice but to run as much resources as your application can possibly demand in its highest peak, and that is usually for 24/7. On the other hand when you leverage cloud possibilities just right, your resource usage can mimic your actual workload with its peak and idle phases and you will pay only for what you actually us.

We will look closely on how we use autoscaling on the Google Cloud Platform, more specifically on the Google kubernetes engine (GKE), managed kubernetes environment by Google, with examples and technical details.

Scaling in GKE engine can be divided into two parts. On the underlying host side, the GKE, we create a node-pool (a group of instances used in Google Kubernetes Engine) and in the configuration we enable pool autoscaling and set minimum and maximum instances running at the same time. This can be done in both Google UI console and gcloud cli as well as other 3rd party provisioning services like Terraform.

On the Kubernetes side we create horizontal pod autoscaler (HPA) which will continuously compute and set desired number of replicas of a kubernetes deployment. In our example, we have a deployment, called nginx, with nginx base image. We create horizontal pod autoscaler with this command:

kubectl autoscale deployment nginx --min=2 --max=10 --cpu-percent=60

HPA can also be created via yaml configuration. With this two easy steps our application is ready to scale based on cpu utilization. Average cpu utilization is calculated across all running replicas of a deployment in one minute basis and is compared to target utilization, in this case 60%

We will simulate some basic trafic load with ab test tool. Service loadbalancer will distribute traffic across all available replicas.

Now the Horizontal pod autoscaler info will shows that the average load across all (two at this point) pods is above target and sets current replicas count to 4. And here comes the issue, one of the new pods didn’t start up properly and hangs in a pending state.  

If we investigate that pod with “kubectl describe pod POD_NAME”, in the pods events we see that pod failed scheduling due to insufficient CPU resources and besides that we see that it triggered the “scale-up” signal, which is signal for the underlying infrastructure (in this case the GKE) to add more nodes. We even get the information that we will scale from 2 to 3 nodes.

After a minute or so we see that additional node was created to serve our needs

With the new resources, all four pods will be up and running at this time.

After another minute we see that the average load across all pods stabilized roughly around specified target. In this case we have very stable load, that’s why our calculated number of replicas result in almost spot on target utilization. In real world of much more variable load the average utilization as well as number of replica can fluctuate very often.

We saw that within few minutes the system was able to react to increased load and deploy as much replicas as was needed to meet the specified target utilization.

Although this solution is very powerful, it doesn't provide all the answers to every scaling question. As much as the reaction to increased load was fast, in some cases even this becomes too slow and potentially fatal to the entire system. One such example could be short batch workload with basically no leading edge. In this case system won’t react that quick and applications can start failing under heavy utilization. In ideal world this type of batch work would be anticipated everytime and dealt with predictive resource planning and scaling. In a real world, we sometimes have to settle with raising lower bound for number of replicas and more aggressive autoscale configuration, which results in higher bill at the end of the day.

In this case system won’t react that quick and applications can start failing under heavy utilization. In ideal world this type of batch work would be anticipated everytime and dealt with predictive resource planning and scaling. In a real world, we sometimes have to settle with raising lower bound for number of replicas and more aggressive autoscale configuration, which results in higher bill at the end of the day.

 

FAQs

Q1: What are the two main parts of autoscaling in Google Kubernetes Engine (GKE)?

Scaling in GKE is divided into two parts: node-pool autoscaling on the GKE infrastructure side, which adjusts the number of server instances, and the Horizontal Pod Autoscaler (HPA) on the Kubernetes side, which adjusts the number of application replicas.

Q2: How do you configure GKE node-pool autoscaling?

You create a node-pool, which is a group of instances, and in its configuration, you enable pool autoscaling. You must also set the minimum and maximum number of instances that can be running at the same time.

Q3: What is a Horizontal Pod Autoscaler (HPA) and how does it function?

An HPA is a Kubernetes feature that continuously calculates and sets the desired number of replicas for a deployment. It functions by calculating the average CPU utilization across all running replicas every minute and comparing it to a target utilization percentage that you define.

Q4: How can an HPA be created for a Kubernetes deployment?

An HPA can be created via the command line with a command like kubectl autoscale deployment [deployment-name] --min=2 --max=10 --cpu-percent=60, or it can be defined in a YAML configuration file.

Q5: What happens if the HPA tries to add new pods but there aren't enough CPU resources on the existing nodes?

The new pods will get stuck in a "pending" state. This scheduling failure due to insufficient resources automatically triggers a "scale-up" signal to the underlying GKE infrastructure, which then begins the process of adding a new node to the cluster.

Q6: Once a scale-up is triggered, how long does it take for a new node to be added?

After the scale-up signal is triggered by a pending pod, it takes about a minute for a new node to be created and become available.

Q7: Is the standard GKE autoscaling solution suitable for all types of workloads?

No. While it is powerful, it may be too slow for certain scenarios. For example, a short but intense batch workload could cause applications to fail under the heavy load before the system has enough time to react and scale up.

Q8: How can you configure scaling for short, intense workloads that are hard to predict?

A common real-world solution is to configure a higher minimum number of replicas and use a more aggressive autoscaling configuration. This ensures more resources are available by default but also results in a higher bill.