Scaling in Google Kubernetes Engine
One of the main advantages of hosting application in the cloud is definitely the ability to scale resources in reaction to growing traffic or predictable traffic peaks that can be specific to your business. With on-premises solutions, you have no choice but to run as much resources as your application can possibly demand in its highest peak, and that is usually for 24/7. On the other hand when you leverage cloud possibilities just right, your resource usage can mimic your actual workload with its peak and idle phases and you will pay only for what you actually us.
We will look closely on how we use autoscaling on the Google Cloud Platform, more specifically on the Google kubernetes engine (GKE), managed kubernetes environment by Google, with examples and technical details.
Scaling in GKE engine can be divided into two parts. On the underlying host side, the GKE, we create a node-pool (a group of instances used in Google Kubernetes Engine) and in the configuration we enable pool autoscaling and set minimum and maximum instances running at the same time. This can be done in both Google UI console and gcloud cli as well as other 3rd party provisioning services like Terraform.
On the Kubernetes side we create horizontal pod autoscaler (HPA) which will continuously compute and set desired number of replicas of a kubernetes deployment. In our example, we have a deployment, called nginx, with nginx base image. We create horizontal pod autoscaler with this command:
kubectl autoscale deployment nginx --min=2 --max=10 --cpu-percent=60
HPA can also be created via yaml configuration. With this two easy steps our application is ready to scale based on cpu utilization. Average cpu utilization is calculated across all running replicas of a deployment in one minute basis and is compared to target utilization, in this case 60%.
We will simulate some basic trafic load with ab test tool. Service loadbalancer will distribute traffic across all available replicas.
Now the Horizontal pod autoscaler info will shows that the average load across all (two at this point) pods is above target and sets current replicas count to 4. And here comes the issue, one of the new pods didn’t start up properly and hangs in a pending state.
If we investigate that pod with “kubectl describe pod POD_NAME”, in the pods events we see that pod failed scheduling due to insufficient CPU resources and besides that we see that it triggered the “scale-up” signal, which is signal for the underlying infrastructure (in this case the GKE) to add more nodes. We even get the information that we will scale from 2 to 3 nodes.
After a minute or so we see that additional node was created to serve our needs
With the new resources, all four pods will be up and running at this time.
After another minute we see that the average load across all pods stabilized roughly around specified target. In this case we have very stable load, that’s why our calculated number of replicas result in almost spot on target utilization. In real world of much more variable load the average utilization as well as number of replica can fluctuate very often.
We saw that within few minutes the system was able to react to increased load and deploy as much replicas as was needed to meet the specified target utilization.
Although this solution is very powerful, it doesn't provide all the answers to every scaling question. As much as the reaction to increased load was fast, in some cases even this becomes too slow and potentially fatal to the entire system. One such example could be short batch workload with basically no leading edge. In this case system won’t react that quick and applications can start failing under heavy utilization. In ideal world this type of batch work would be anticipated everytime and dealt with predictive resource planning and scaling. In a real world, we sometimes have to settle with raising lower bound for number of replicas and more aggressive autoscale configuration, which results in higher bill at the end of the day.
In this case system won’t react that quick and applications can start failing under heavy utilization. In ideal world this type of batch work would be anticipated everytime and dealt with predictive resource planning and scaling. In a real world, we sometimes have to settle with raising lower bound for number of replicas and more aggressive autoscale configuration, which results in higher bill at the end of the day.