1 Introduction to Kubernetes Autoscaling

Autoscaling is at the heart of an efficient and cost-effective Kubernetes cluster. Yet, many organizations struggle with resource management, often leading to significant financial waste or system failures. Understanding and implementing proper scaling can be the difference between a thriving, responsive system and one that's constantly on the brink of collapse.In this chapter, I'll introduce you to the concepts, strategies, and technologies surrounding autoscaling in Kubernetes, exploring its mechanisms, benefits, and challenges. You'll understand why scaling needs to evolve from a manual to an automatic process and how it has become an integral part of modern cloud-native architectures. I've worked with organizations that have achieved between 50% to 70% cost reduction in their compute expenses by making their Kubernetes clusters more efficient. This book aims to guide you on how you can achieve similar optimizations.In this chapter, I'm going to cover the following main topics:

Kubernetes autoscaling
The need of autoscaling in Kubernetes
How autoscaling works in Kubernetes
Different types of autoscaling angles in Kubernetes
Challenges that autoscaling solve, and what new challenges introduces

By the end of this chapter, you'll understand how autoscaling works in Kubernetes and what strategies you need to implement to have efficient and cost-optimized clusters. You'll finish by setting up a Kubernetes cluster, which you'll use to practice what you'll learn in this book. In the following chapters, I'll go into much more depth about how to implement all of the concepts covered here.

Book conventions

The upcoming sections will introduce various new concepts, accompanied by hands-on examples that need active engagement with a Unix shell environment. To ensure clarity in the practical demonstrations, the book will adhere to the following conventions:

Every practical section will start with the "Hands-On" prefix.
Commands preceded by a $ symbol indicate actions you need to perform.
For shell commands or outputs that exceed the width of a single page, we'll use the \ character to break the line. The continuation of the command or output will appear on the subsequent line.

These guidelines will help you to interpret long commands or outputs more easily as we progress through the book.

Technical Requirements

To practice what you'll be learning throughout this book, you need to have access to a Kubernetes cluster. In this chapter, I'm offering two options for cluster creation: a local setup and a cloud-based solution in AWS. If you're just starting out or you want to focus solely on workload scaling in the upcoming chapters, the local option using Kind (Kubernetes in Docker) is a good choice. Alternatively, you could use Docker Desktop and turn on Kubernetes or use a Kubernetes playground like Killercoda or Play with Kubernetes.However, if you're eager to dive into the full cloud experience from the start, I'll also provide instructions for setting up an Amazon EKS cluster using Terraform. This option will give you a more realistic environment, closely mimicking production scenarios. Just remember, if you choose this route, to delete all resources after your study sessions to avoid unnecessary charges.To make the process as smooth as possible, all the files I'll use to provision the cluster are available in this GitHub repository: https://github.com/PacktPublishing/Kubernetes-Autoscaling. This repository will be regularly updated, ensuring you always have access to the most current configurations.You need to meet the following pre-requisites to complete the hands-on labs:

Git
Helm 3.16.1+
kubectl (latest version)
Go 1.16+ or Docker, podman, or nerdctl
A command line interface (CLI) to run all the commands
Terraform 1.9.5+ (AWS only)
AWS CLI 2.17.43+ (AWS only)
Access to an AWS account with to create an Amazon EKS cluster (AWS only)

Scalability Foundations

Before I dive into how autoscaling works in Kubernetes, let me first talk about what scaling means in the context of computing systems.Scaling is the ability of a system to handle increasing amounts of work by adding resources to manage the load efficiently. It's a fundamental concept in computing that has become increasingly crucial in the era of cloud-native applications and microservices architectures. The true power of scaling lies in its automation - the ability to adjust resources dynamically without human intervention, which is essential for managing the rapid and unpredictable demand. The following diagram illustrates the pitfalls of static resource allocation:

Figure 1.1 – Challenges with static and manual infrastructure

As you can see in Figure 1.1, Overprovisioning wastes resources and increases costs, while Underprovisioning risks poor performance and potential system failures. Autoscaling offers a solution to both these problems. Consider an online retailer during a flash sale: with autoscaling, their system automatically detects increased load and rapidly provisions additional resources to handle the traffic spike. As the sale ends and traffic subsides, it scales back down, releasing unnecessary resources. This dynamic approach ensures optimal performance during peak times, while also minimizing costs during periods of lower demand. By automatically adjusting resources in real-time, autoscaling can help you maintain a balance between performance and efficiency, adapting to unpredictable demand patterns without manual intervention.

A Bit of History

The concept of autoscaling took a significant leap forward in 2009 when Amazon Web Services (AWS) introduced Auto Scaling (known now as Amazon EC2 Auto Scaling). This feature allows you to automatically adjust the compute capacity to meet your application demands at the lowest possible cost. This shift towards automation not only improved efficiency but also allowed organizations to be more responsive to fluctuating demands, rather than getting bogged down in system administration and operational burdens if done manually.After 2009, numerous organizations began implementing and refining autoscaling strategies. Netflix, for instance, documented their approach of aggressive scaling up and down to meet variable consumer demands. Other companies like Dropbox, Spotify, and Airbnb also shared their experiences and best practices for autoscaling in cloud environments. The way I like to see this is that this collective knowledge and experience from various industry leaders contributed to the evolution of autoscaling technologies and practices. As cloud-native architectures became more prevalent, the need for more sophisticated, application-aware autoscaling mechanisms grew.Kubernetes, with its origins in Google's experience running large-scale production workloads, embodies many of these autoscaling best practices. From its beginnings, Kubernetes was designed with the principles of efficient resource allocation and automatic scaling in mind.By building autoscaling into its core functionality, Kubernetes took the lessons learned from years of cloud scaling experiences and made them accessible to a wider range of applications and organizations. This native support for autoscaling has been a key factor in Kubernetes' widespread adoption, as it allows teams to focus on building and improving their applications rather than managing the underlying infrastructure scaling.

Horizontal and Vertical Scaling

Now that you understand the importance of scaling and its automated nature, let's explore the two primary approaches to scaling: vertical and horizontal. These methods offer different ways to increase a system's capacity to handle load, each with its own advantages and challenges.

Vertical Scaling

Vertical scaling, often referred to as "scaling up," involves boosting the capabilities of an existing machine or instance. This method typically focuses on enhancing a single node's processing power, memory capacity, or storage. Picture a scenario where your application server is struggling with its current workload. A vertical scaling solution might involve upgrading a node from 2 vCPUs with 4 GiB of RAM to node of 8 vCPUs with 16 GiB of RAM, as shown in Figure 1.2:

Figure 1.2 show that This approach is simple, particularly for applications not designed with distributed architecture in mind. It's often the go-to solution for improving the performance of monolithic applications or certain types of databases that prefer running on a single, powerful machine.However, vertical scaling isn't without its drawbacks:Costs: The cost of high-end hardware can escalate rapidly.Physical limitations: There is an upper bound to how much you can enhance a node.Potential interruptions: Hardware upgrades cause system downtime.Reliability: Having one (or few) node(s) creates a potential single point of failure.While vertical scaling can offer quick performance boosts, it may not provide the adaptability and robustness required for your workload, especially if it's very dynamic.

Horizontal Scaling

Horizontal scaling, often called "scaling out," takes a different approach to handling increased load. Instead of boosting a single node, horizontal scaling involves adding more nodes to your system. The following diagram illustrates an example of this:

Imagine that Figure 1.3 shows what happens in a restaurant: vertical scaling would be like hiring a super-chef, while horizontal scaling would be like adding more cooks to the kitchen. In practice, horizontal scaling might look like this: if your web application is struggling with high traffic, instead of upgrading a single server, you'd add more similar servers to distribute the load. This approach aligns well with modern, distributed architectures and cloud-native applications.Horizontal scaling offers several advantages that make it particularly useful for dynamic and distributed systems. For starters, the theoretically unlimited scalability it provides; as demand grows, you can continually add more machines to your system to meet these increasing needs. This approach also enhances fault tolerance significantly; if one node experiences problems or fails entirely, the other nodes in the system can compensate, ensuring continued operation and minimizing downtime. Furthermore, with horizontal scaling you can closely match your system's capacity to current demands, effectively avoiding the costly pitfall of over-provisioning resources. Horizontal scaling is particularly well-suited for stateless applications, microservices architectures, and distributed systems.However, horizontal scaling could also pose some challenges:Complexity: Applications need to be designed to work across multiple nodes.Consistency: Ensuring data remains consistent across all nodes can be tricky.Network overhead: Communication between nodes can introduce latency.License costs: Some software licenses may charge per node.In the context of Kubernetes, which we'll explore in depth later, horizontal scaling is a fundamental concept. Kubernetes ability to automatically scale the number of pods running an application is a prime example of horizontal scaling in action. However, you'll see when vertical scaling might make more sense or can even help you right-size your workloads.

Kubernetes Architecture

Before I dive into the specifics about autoscaling in Kubernetes, let's have a quick overview of the Kubernetes architecture and its components, especially those involved in the autoscaling aspect of Kubernetes.Assuming you're already familiar with Kubernetes, you know that each cluster is composed of several nodes where one or more components run (e.g., kube-apiserver, kube-scheduler, or kubelet). A node is essentially a server, machine, or instance. Depending on where you're hosting your Kubernetes cluster, you'll encounter different terminology. In this book, I'll stick with node. As shown in Figure 1.4, a very simplified version of the Kubernetes architecture, Kubernetes cluster nodes are grouped into two: 1) the Control Plane, and 2) the Data Plane.

In Figure 1.4, you can see the control plane nodes on the left side. TheseThe Control Plane nodes contain all the components that manage the overall state of the cluster, as per the Kubernetes documentation. If you set up your cluster on your own using tools like kubeadm, kops, or kubespray, you need to manage the scaling of the nodes yourself. On the other hand, if you set up your cluster using managed services like Amazon Elastic Kubernetes Service (Amazon EKS), Google Kubernetes Engine (GKE), or Azure Kubernetes Service (AKS), the companies offering these services handle the scaling of the nodes for you. Whichever setup you have, this book won't cover the autoscaling aspect for the control plane, focusing instead on the data plane.In Figure 1.4, you can see the data plane nodes on the right side. These nodes are where your workloads run. The more workloads you deploy to the cluster, the more nodes you'll end up having. This is because each node should host a limited set of pods, mainly for two reasons: 1) you don't want to have all your pods on a single large node for high availability reasons, and 2) a node can't host more than 110 pods per node (as per today's official Kubernetes documentation).The big question now is: how many nodes does your workload need? How can you ensure you're not wasting resources while also not constraining the cluster? In other words, how do you build an efficient Kubernetes cluster? It's tricky, so let's talk about what makes a data plane efficient, at least from the compute perspective.

Note

Even though there are other aspects besides compute like storage or networking that has an impact on how you define the number of nodes, this book will only focus on the compute aspect. However, as we'll see in the upcoming chapters, aspects like networking have an impact on how you scale your nodes. We'll dive deep on that later.

Efficient Kubernetes Data Planes

Understanding how to optimize your Kubernetes data plane is crucial for building cost-effective, high-performing clusters that can adapt to changing workloads. To do so, we'll explore what efficiency means in the context of Kubernetes, why it's important, and the key factors that contribute to an efficient data plane. Then, I'll cover the challenges of achieving and maintaining efficiency, as well as strategies for optimizing resource utilization.

What do I mean by efficiency?

To answer this question, let's first talk about what efficiency means in a broader sense. Efficiency, in general terms, is the ability to accomplish a task with minimal waste of time and effort. It's about achieving the desired outcome while using the least number of resources possible. In other words, no wasting of resources. In a formula, as shown in Figure 1.5, it will be like this:

Figure 1.5 – Efficiency formula for compute resourcesCaption

Now, let's apply this concept to Kubernetes, specifically to the data plane nodes. In this context, efficiency means optimizing the utilization of compute resources to achieve the best performance at the lowest cost. It's not just about using all available resources; it's about using them wisely and effectively.As the focus of this book is on compute, this means that efficiency is about maximizing the use of available CPU and memory while minimizing waste. In traditional on-premises environments, efficiency might have been less of a concern as hardware costs were often treated as a sunk cost. However, in the cloud, where you typically pay for what you use (also known as pay-as-you-go), efficiency has a direct impact on your monthly bill. This makes resource optimization not just a best practice, but a financial imperative.So, as this book will focus in the cloud, an efficient Kubernetes data plane in the cloud context involves several key aspects:Right-sizing your resources: Ensuring your pods have enough resources to run effectively, but not so much that they're wasting capacity.Maximizing node utilization: Filling your nodes with an optimal number of pods to avoid running unnecessary nodes.Scaling appropriately: Increasing or decreasing resources in response to actual demand, rather than over-provisioning for peak loads.Avoiding idle resources: Identifying and eliminating unused or underutilized resources that you're still paying for.Choosing appropriate instance types: Selecting the most cost-effective node types for your workloads.By focusing on these aspects of efficiency, you can significantly reduce your cloud costs. For instance, an inefficient cluster might have nodes running at 30% capacity, meaning you're essentially paying for 70% of unused resources. An efficient cluster, on the other hand, might consistently run at 70-80% capacity, dramatically reducing waste and, consequently, costs. Of course, these numbers can vary depending on the type of workloads you have and the technical debt they might carry.Achieving this level of efficiency isn't straightforward. In the following chapters, I'll dive into the details of how you can address the key aspects I mentioned before. The goal is to help you build and maintain an efficient data plane, striking the right balance between performance, reliability, and cost-effectiveness. However, before doing that, it's important to understand the challenges and considerations to achieve efficiency in Kubernetes (actually, for compute in general).

Challenges and considerations

Several factors across the entire application stack influence performance and efficiency. The design of your application components, including layers, dependencies, and how they interact, have an impact on efficiency. The underlying infrastructure, from the physical hardware to the operating system, plays an important role in overall performance. How Kubernetes schedules pods and scales resources directly affects resource utilization. Moreover, network performance factors like latency, throughput, and the configuration of load balancers can impact the efficiency of your cluster.Achieving efficiency is challenging due to several factors. Workloads often have diverse resource requirements; some may be CPU-bound, others memory-bound, leading to uneven resource utilization across nodes. Kubernetes typically scales at the pod or node level, which can be too coarse for optimal efficiency, and make it complicated. And let's not forget that there's a delay between when additional resources are needed and when they become available. Workloads with dependencies often require resource buffers while waiting on these dependencies, leading to periods of underutilization, which might not be much but still are things you need to consider.Non-technical factors can also lead to inefficiency. Overprovisioning out of an abundance of caution is common, especially for critical workloads. This is a very typical scenario for peak seasons when it's better to keep business continuity than an ideal efficiency number.While striving for an efficient Kubernetes data plane is important for cost optimization, it's a complex challenge that requires a holistic approach. It involves not just infrastructure management, but also application design, careful capacity planning, and sometimes, a willingness to accept calculated risks. As you proceed through this book, you'll explore strategies and tools that can help you navigate these challenges and achieve a more efficient Kubernetes cluster.Now that you understand the complexities and challenges of achieving efficiency in Kubernetes environments, let's explore the autoscaling mechanisms that Kubernetes provides to address these challenges.

Kubernetes Autoscaling Categories

When working with Kubernetes, there are two primary categories of scaling that require attention, as alluded to earlier. The first category pertains to the application workloads, represented by pods, while the second category involves the underlying infrastructure of the data plane, specifically the nodes. This distinction is crucial because each category requires different approaches and tools for effective autoscaling.

Application Workloads

The first step in your Kubernetes efficiency journey focuses on application workloads, specifically the pods that run your applications. This is a critical starting point because it directly impacts how efficiently your cluster utilizes resources and responds to changing demands. Central to this optimization is setting appropriate resource requests for your pods. This is crucial because the kube-scheduler uses these requests to make decisions about pod placement. If requests are set too high, you may waste resources and limit the number of pods that can be scheduled. Conversely, if set too low, your applications may underperform or face resource contention.The recommended approach for scaling application workloads is horizontal scaling - adding or removing pod replicas as demand fluctuates. This is typically achieved using the Horizontal Pod Autoscaler (HPA). While I'll dive deeper into HPA in the next chapter, it's worth noting that this tool automatically adjusts the number of pod replicas based on observed metrics like CPU or memory utilization, or custom metrics that you define. There are other tools in the same vein like Kubernetes Event-driven Autoscaling (KEDA) that in essence does a similar job to HPA, but the options to scale are broader. I'll dive into much more details about this in the next chapters.Let's consider an example: Imagine you have a web application that experiences varying levels of traffic throughout the day. During peak hours, the CPU utilization of your pods might spike to 80% or higher, similar to what you see in Figure 1.6:

Figure 1.6 – Pods scaling out due to resource utilization

The preceding figure illustrates that when pods from a deployment are using 80% of their CPU capacity, it's time to add more replicas to prevent the workload from underperforming. In this scenario, the HPA could automatically increase the number of pod replicas to distribute the load. As traffic subsides and CPU utilization drops, the HPA would then reduce the number of replicas, conserving resources.Let's consider an example: Imagine you have a web application that experiences varying levels of traffic throughout the day. During peak hours, the CPU utilization of your pods might spike to 80% or higher. As you can see in Figure 1.7, HPA automatically increased the number of pod replicas to distribute the load. As traffic subsides and CPU utilization drops, the HPA would then reduce the number of replicas, conserving resources.

Figure 1.7 – Web application scaling out using an HPA rule

While horizontal scaling is the recommended best practice, there are situations where adjusting the resources of individual pods might be necessary. The Vertical Pod Autoscaler ( VPA ) is a tool that can help with this, automatically adjusting CPU and memory requests based on historical resource usage. However, it's important to note that vertical scaling is generally less flexible and can lead to pod restarts, which may cause brief service interruptions. We'll dive deep later in Chapter 3 about this as well.

In general, I advocate for horizontal autoscaling over vertical autoscaling because of:Better fault tolerance: If one pod fails, others can continue serving requests.Improved resource utilization: It's often easier to fit many small pods across your nodes than a few large ones.Easier rolling updates: With multiple replicas, you can update your application without downtime.

Data Plane Nodes

Scaling the data plane nodes becomes necessary when the cluster's capacity to run pods reaches its limit, especially when nodes are added only as needed. What I didn't say before is that when you scale out your workloads by adding more pods (following the recommendation for horizontal scaling), you may eventually reach a point where the cluster lacks the capacity to run these additional pods. This results in unscheduled pods, which serve as a trigger for the second category of Kubernetes autoscaling: scaling the data plane nodes. Tools like Cluster Autoscaler (CAS) and Karpenter are designed to address this challenge. These projects react to the presence of unscheduled pods in the cluster by adding more capacity. Both have built-in logic to interact with the underlying infrastructure and provision additional nodes as needed. I'll dive deep into these tools in the following chapters.Let's continue with our web application example. As the load increases and more pods are added to handle the traffic, you might reach a point where all existing nodes are at capacity. Let's say you have only 1 node capable of running 4 pods, and your HPA or KEDA scaling policy has determined that you need now 10 pods to handle the current load, see the following image:

Figure 1.8 – Unscheduled pods as there are not enough nodes to support the scaling event

As shown in Figure 1.8, only 2 additional pods could fit into the existing node, this leaves 6 pods unscheduled.At this point, Karpenter or CAS would spring into action. They would detect the unscheduled pods and initiate the process of adding a new node to the cluster. In a cloud environment, these tools already know which APIs to call to provision a new node and automatically register it to the cluster, eliminating the need for manual intervention. See the following image for an example of this process:

Figure 1.9 – Karpenter or CAS adding more nodes when it's neededIn

Figure 1.9, you can sese that once the new node is ready, the kube-scheduler will place the unscheduled pods on this newly available capacity. This process happens automatically, ensuring that your application can scale to meet demand without manual intervention or overprovisioning.Conversely, when demand decreases and fewer pods are needed (thanks to HPA or KEDA), these same tools ensure that any extra nodes added during the peak are removed, maintaining the cluster's efficiency, as shown below:

Figure 1.10 – Karpenter or CAS removing nodes when they're not needed anymore

For instance, as you can see in Figure 1.10, if traffic drops and you only need 90 pods, CAS or Karpenter might remove two nodes from the cluster, returning to the original 10-node configurationIt's important to note that we're advocating for horizontal scaling at this level as well - adding or removing entire nodes rather than trying to resize existing ones. Unlike with pods, there are no established projects for vertical autoscaling of nodes, as this approach introduces unnecessary complexity and potential instability to the cluster.By leveraging these node autoscaling capabilities in conjunction with pod-level autoscaling, you can create a highly responsive and efficient Kubernetes cluster. This setup allows your cluster to dynamically adapt to changing workloads, ensuring optimal resource utilization and cost-effectiveness. Hold tight, I'll tell you how to implement all of this in the following chapters.Now that I've covered the theoretical foundations of Kubernetes autoscaling, it's time to setup your practice environment for this book.

Hands-On: Creating a Kubernetes Cluster

Enough theory - it's time to get our hands dirty! Grab your computer, open your terminal, and prepare to dive into the practical side of Kubernetes autoscaling. In this section, I'll guide you through creating your very own Kubernetes cluster, providing you with a sandbox environment to experiment with the concepts we've discussed.If you already have a Kubernetes cluster that you could use to practice the workload autoscaling chapters, skip this section and go to the summary of the chapter.

Local Kubernetes cluster with Kind

Kind, short for Kubernetes in Docker, is a tool that enables you to create and manage local Kubernetes clusters by leveraging Docker containers as simulated nodes. It was primarily created for testing Kubernetes itself, but has become popular among developers for local development and continuous integration (CI) pipelines. Kind allows you to quickly spin up a multi-node Kubernetes cluster on your local machine, making it ideal for testing, learning, and developing Kubernetes-native applications without the need for a cloud provider or physical servers.

Installing Kind

Please visit the official site for the most recent installation methods: https://kind.sigs.k8s.io/docs/user/quick-start/. As of today, you can install Kind with go install, from the source, using release binaries, or with community-managed packages like brew (Mac) or choco (Windows). If you're using Docker Desktop, you need to turn-off Kubenetes.For instance, if you have a Mac, simply run this command to install Kind:

$ brew install kind

To confirm it's working, you can see which version you installed with this command:

$ kind version

Creating a Kind Cluster

To create a Kubernetes cluster, simply run this command:

$ kind create cluster --name kubernetes-autoscaling

Next, run this command to configure access to the cluster:

$ kubectl cluster-info --context kind-kubernetes-autoscaling

Confirm that you can use the cluster by getting the list of Kubernetes nodes:

$ kubectl get nodes

You should see an output similar to this that shows the Kubernetes nodes:

NAME                                   STATUS   ROLES           AGE   VERSION
kubernetes-autoscaling-control-plane   Ready    control-plane   15m   v1.31.0

You're now ready to use a local Kubernetes cluster without incurring any cloud costs. However, if you're able and willing to practice with an environment similar to what you'll interact with in Chapter 7 and beyond, proceed to the following section.

Cloud Kubernetes cluster in AWS

Let's create an Amazon EKS cluster using Terraform, an infrastructure as code tool that allows you to define your desired cluster state as code. You'll be using a custom Terraform template that leverages the Amazon EKS Blueprints for Terraform project. This template will set up a complete Amazon EKS cluster, including a VPC, a Kubernetes control plane, and the necessary IAM roles and service accounts. It will also configure a managed node group for essential system components.If you're not familiar with Terraform or HCL syntax, reviewing Terraform's documentation can be helpful (https://developer.hashicorp.com/terraform/docs).

Note

Terraform uses AWS credentials that must be configured in your environment. The recommended and more secure approach is to authenticate using: 1/ Named AWS CLI profiles (via aws configure --profile your-profile), 2/ IAM roles via instance metadata (for EC2 or CloudShell environments), or 3/ AWS SSO or IAM Identity Center. Avoid hardcoding AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY unless absolutely necessary, as this approach is less secure and harder to manage.

Creating an Amazon EKS Cluster

Start by cloning this repository using git to your computer:

$ git clone https://github.com/PacktPublishing/Kubernetes-Autoscaling.git

Change directory to the chapter one folder using this command:

$ cd Kubernetes-Autoscaling/chapter01/terraform

Before you start, make sure your AWS credentials are properly configured in the terminal you'll be using. Then, you need to create an environment variable for the AWS region you'll be using. For example, if you're using the Ireland region (eu-west-1), you need to run a command like this:

$ export AWS_REGION=eu-west-1

To begin the process of creating the cluster, run this command:

$ sh bootstrap.sh

This script includes several Terraform commands that must run in a specific order and will take around 20 minutes to complete. Once finished, configure access to the new cluster:

$ aws eks --region $AWS_REGION update-kubeconfig \
--name kubernetes-autoscaling

This command updates your ~/.kube/config file, allowing kubectl to connect to your new cluster.Confirm that you can use the cluster by getting the list of Kubernetes nodes:

$ kubectl get nodes

You should see an output similar to this that shows the Kubernetes nodes:

NAME                STATUS   ROLES    AGE     VERSION
ip-10-0-107-82...   Ready    <none>   4m18s   v1.30.2
ip-10-0-57-88...    Ready    <none>   4m18s   v1.30.2

The output has been shortened, but essentially what you should see is the list of nodes registered to the EKS cluster. These nodes will serve as a base for you to practice what you'll learn in the upcoming chapters.

Remove the Amazon EKS Cluster

To keep cloud costs at minimum while you study, make sure you turn down the resources created by Terraform so you don't pay while not using the cluster. The next time you comeback to study, you'll need to create the cluster again by following the steps from the previous section.To remove all resources created by Terraform, run the following command:

$ sh cleanup.sh

This script includes several other commands that need to run in a specific order.…

Summary

In this chapter, you've explored the fundamental concepts of Kubernetes autoscaling, understanding its critical role in achieving efficient and cost-effective Kubernetes clusters. You've learned about the complexities of scaling both application workloads and infrastructure nodes, examining the challenges and considerations that come with each. You've learned about the importance of proper resource allocation, the benefits of horizontal scaling, and the tools Kubernetes provides to automate these processes.By implementing the autoscaling mechanisms exposed here, you're not just optimizing for performance and cost – you're also contributing to a more sustainable approach to cloud computing. Efficient resource utilization means less wasted energy and reduced carbon footprint. As we move forward, remember that every optimization you implement, every unnecessary pod you eliminate, and every node you rightsize is a step towards more sustainable practices.In the world of Kubernetes, we'll be truly: Saving the planet, one pod at a time.

2 Workload Autoscaling Overview

In the previous chapter, I covered the importance of autoscaling in general, the concept of efficiency in compute, and how to implement autoscaling in Kubernetes. But I also introduced the two main categories of autoscaling in Kubernetes: infrastructure and workload. Building on that foundation, this chapter focuses on the latter - workload autoscaling.Efficiently scaling workloads in a Kubernetes cluster is crucial for maintaining optimal performance and resource utilization. Building efficient workloads will have an impact on how many nodes you'll end up needing in a Kubernetes data plane.This chapter goes deeper into workload autoscaling, discovering the challenges, strategies, and tools available to help you get started with building efficient workloads.Additionally, I'll explore the key components that contribute to effective workload autoscaling, along with its challenges. Most importantly, this chapter will focus on understanding how the Kubernetes scheduler works, the importance of observability, and why workload rightsizing is so crucial. With this groundwork laid, I'll then cover the different workload autoscaling mechanisms with hands-on sections, such as HPA, and VPA, KEDA.I briefly introduced them in the previous chapter, and now we'll learn their basics. After this, you'll gain hands-on experience by diving deeper into these concepts in the next chapters.By the end of this chapter, you'll have a solid knowledge base of how to implement workload autoscaling in Kubernetes, to then go deeper into each of the workload autoscaling mechanisms I mentioned before.In this chapter, we'll cover the following topics:

Challenges of workload autoscaling
How does the Kubernetes scheduler work
Workload rightsizing
Workload autoscalers

Technical Requirements

For this chapter, you'll continue using the Kubernetes cluster you created in Chapter 1. You can continue working with the local Kubernetes cluster; there's no need yet to work with a cluster in AWS. If you turned down the cluster, make sure you bring it up for every hands-on lab in this chapter. You don't need to install any additional tools as most of the commands are going to be run using kubectl.You can find the YAML manifests for all resources you're going to create in this chapter in the chapter02 folder from the Book's GitHub repository (https://github.com/PacktPublishing/Kubernetes-Autoscaling). During this chapter, you're going to use a sample application that I built and pushed to the Docker registry. If you want to explore the source or build and push the image to a different registry, you can find all the source code in the chapter02/src/ folder.

Challenges of autoscaling workloads

Before I dive into any solutions or recommendations on Kubernetes, it's important to recognize that while the concept of workload autoscaling may seem straightforward, its implementation can be quite complex. Let me explain why by describing some of the key challenges that arise when attempting to scale workloads efficiently.For starters, it's important to understand the interconnected nature of workload and infrastructure scaling. While workloads are typically scaled based on resource utilization, nodes should be scaled according to resource reservation. These two aspects are intrinsically linked, and their interaction can lead to unexpected outcomes. For instance, if workload replicas are increased without proper rightsizing, you may find yourself needing more nodes to accommodate all the necessary workload replicas, potentially leading to inefficient resource usage.The diversity of application behaviors presents another significant challenge. Consider a simple application, such as a Backend API, Shopping Portal, or a Pi calculation using the Monte Carlo simulation, where CPU utilization directly correlates with the number of requests per second. This is illustrated in the following figure:

Figure 2.1 – CPU utilization might correlate to the number of requests per second

In Figure 2.1, you can see that as the number of requests grows, the CPU utilization grows as well. In such cases, setting scaling policies based on CPU utilization thresholds might be very effective. However, not all applications behave in this predictable manner, nor do they have the same requirements. This variability in application behavior is a major challenge, and it's important to note that this issue isn't unique to containers or Kubernetes but is relevant across various computing environments.Some applications may be memory-intensive or may even reserve a specific amount of memory at startup, employing their own mechanisms for memory management such as garbage collection or defragmentation. In these cases, memory utilization alone may not be an adequate metric for scaling decisions.Moreover, there are scenarios where traditional resource metrics like CPU and memory utilization may appear normal, but the application still experiences issues. For example, as shown in Figure 2.2, some requests might fail due to increased latency or intermittent network problems when interacting with external dependencies, especially if proper retry mechanisms are not in place.

Figure 2.2 – Applications might still fail even if CPU and memory utilization looks good

The key takeaway is that while CPU and memory metrics might be suitable scaling indicators for some applications, more complex applications may require consideration of other factors such as latency, number of requests, queued messages, or even business key performance indicators (KPIs). Therefore, it's crucial to invest time in understanding how your application behaves under various load conditions to determine the most appropriate metrics for scaling. This knowledge will inform your autoscaling strategy and help you avoid potential pitfalls.As I move forward, you'll explore how to make use of these different metrics I mentioned to configure your scaling policies, what factors to consider, and how Kubernetes schedules pods. The following sections will focus on these aspects, primarily from a Kubernetes perspective, with the aim of highlighting why it's vital to understand what impacts your application's performance and scaling needs.Let's begin by examining how the kube-scheduler works, as this forms the foundation for efficient workload autoscaling in Kubernetes.

How does the Kubernetes scheduler work?

This important job of scheduling pods is done by the kube-scheduler, Kubernetes' default scheduler, which is responsible for assigning pods to nodes based on various factors, including but not limited to nodeSelectors, affinities, tolerations, and most importantly, resource requests. For now, we'll focus primarily on resource constraints.Essentially, the kube-scheduler operates in three main steps:

Filtering: It identifies all nodes that can accommodate the pod based on available resources, mainly memory and CPU. If there are none, the pod is deemed unschedulable.
Scoring: It ranks the filtered nodes to determine the most suitable one.
Binding: It assigns the pod to the node with highest score, choosing randomly if multiple nodes are equally suitable.

While the operation of the kube-scheduler involves many intricate details and variables, our focus will be on one crucial aspect: the pod's resource requirements. Understanding these requirements is essential because they directly influence the scheduler's decision-making process.In Kubernetes, you can specify these resource needs through the pod configuration. Specifically, you can define both the resources a container requires to run (requests) and the maximum resources it's allowed to use (limits). These specifications, typically expressed in terms of CPU and memory, play a vital role in how the kube-scheduler filters and scores nodes for pod placement.

Configuring requests

When you define resource requests for containers within a pod, you're basically telling the kube-scheduler how much resources the node needs to guarantee. This information helps the scheduler make informed decisions about where to place your pod in the cluster, ensuring it lands on a node with sufficient resources to meet its needs. The following image illustrates how the kube-scheduler places a pod on a node where it fits based on its resource requests:

Figure 2.3 – kube-scheduler scheduling a pod to the node with the best ranking score

In Figure 2.3, you can see that the kube-scheduler decided to schedule the pod on the node that has the best ranking score. The second node is the one that has the resources available for the new pod coming to the cluster, while the other two were already at their capacity.While this scheduling mechanism ensures that pods have access to their requested resources, it doesn't always lead to optimal resource utilization. In practice, pods may not use all the resources they've requested. This can result in low node efficiency, as there's often a discrepancy between requested and actually utilized resources. Consequently, this mismatch can lead to underutilization of cluster capacity, highlighting the importance of rightsizing pod resource requests.For an effective autoscaling configuration, you should always configure requests carefully. Moreover, this practice ensures predictable and deterministic results when the kube-scheduler assigns pods to nodes and is crucial for setting up effective scaling policies. Without proper resource specifications, you may encounter unexpected behavior, inefficient resource utilization, and potential performance issues as your application scales.

Configuring limits

Resource management in Kubernetes extends beyond scheduling. While resource requests are used for pod placement, you can also configure resource limits. These limits are enforced during runtime by the kubelet on each node. The kubelet ensures that containers don't consume resources beyond their specified limits, even if the node has additional capacity available. If a container attempts to exceed its limit, the kubelet will restrict its resource usage or terminate the process, depending on the resource type (compressible or non-compressible) and the severity of the violation. Look at the below image:

Figure 2.4 – A pod requesting 1 GiB, but limiting to 1.5 GiB in a node with 8 GiB of capacity

In Figure 2.4, you can see that the pod is requesting 1 GiB of memory as the minimum guaranteed allocation. However, it's setting a limit of 1.5 GiB to cap its maximum resource consumption. The bar at the side of the pod represents how much of the allocated resources the pod is currently using.

What if you don't specify resource requests or limits?

Pods are scheduled as a best-effort, but will be first ones to be evicted/throttled when the node experiences memory or CPU pressure. If no limits are configured, the pod can consume as much resources as it needs. As a safety mechanism, you could configure a LimitRange for a given namespace. A LimitRange is a Kubernetes policy object that defines resource constraints within a namespace. It will apply default values if you don't set resources in your pod, ensuring efficient resource allocation. Alternatively, you can configure a webhook to set default values before pods are scheduled; tools like Kyverno can simplify this task.

Pod configuration example

In Kubernetes, you configure the resource units like this:

CPU: It is measured in units, and it means that 1 CPU unit is 1 physical or virtual core. These units can be fractional, like 0.5 (half o a CPU) or 0.2 (200 millicpu or 200m). You'd typically see the "m" format.
Memory: It is measured in bytes. Can be configured as a number only, or accompanied with a suffix like M or Mi for megabytes. There are other suffixes, but let's keep it short and simple for now. You'd typically see the "Mi" suffix.

Here's an example configuration of a pod with resource and limits configured:

apiVersion: v1
kind: Pod
metadata:
  name: montecarlo-pi
spec:
  containers:
  - name: montecarlo-pi
    image: christianhxc/montecarlo-pi
    resources:
      requests:
        memory: "512Mi"
        cpu: "900m"
      limits:
        memory: "512Mi"
        cpu: "1200m"

Notice how limits are higher than resources. The kube-scheduler will guarantee that the node selected has at least 1200m units of CPU and 512Mi of memory available.

What if the pod exceeds the resource limits?

CPU is a compressible resource; when the CPU utilization limit is met, the container is throttled (no more CPU time for it) but it can continue running. With memory, it is different. If the memory utilization of a container is met, the process that is trying to use more memory is stopped. Typically, this causes Kubernetes to restart the container.

Recommendations for configuring resources and limits

Use your judgment after testing how your application behaves. You could run a set of tests, or monitor live traffic. But generally speaking, you could consider the following recommendations:For CPU, consider setting requests slightly higher: This approach allows for more flexible CPU utilization. When you specify CPU requests with slighly higher limits, you let your workloads to utilize available CPU capacity above what has been requested, adapting to varying demand (especially if it's spiky) without being constrained too much. It's particularly useful for applications with fluctuating CPU needs, as it allows them to use more CPU when necessary with flexible restrictions.For memory, consider setting requests equals to limits: This gives a predictable behavior for your containers, and the kube-scheduler provides the highest priority and stability for your workloads. If you still want to have different values, be cautious about setting limits significantly higher than requests. This can lead to node overcommitment, as the kube-scheduler may place more pods on a node than it can actually handle under peak conditions, potentially causing workload interruptions or out-of-memory events.By now, you have a better understanding how the kube-scheduler schedules pods, and why it's very important to configure resource request to each of the containers in a pod. Next, let's talk about ways of configuring these request units properly, as it's going to become the most important process to have an efficient Kubernetes data plane.

Workload Rightsizing

What does rightsizing mean in the context of Kubernetes? Rightsizing refers to the process of accurately configuring resource requests and limits so that pods only ask for the resources they actually need. This practice aligns with the efficiency and cost optimization goals you're aiming to achieve within your clusters. This section introduces a simple approach to configuring proper resource requests for your pods.Unfortunately, rightsizing is often left until the end, when organizations notice that their cluster costs are escalating due to overprovisioning of resources. Our aim is to guide you on the right path from the very beginning. After all, if you need to configure resource requests, why not do it correctly from day one?To start this process effectively, it's important to have visibility into the resource utilization of workloads running in your cluster. Let's begin by guiding you through the steps to configure your cluster for monitoring resource usage. This visibility will form the foundation for making informed decisions about rightsizing your pods, ensuring that they request only the resources they truly need for optimal performance.

Monitoring

First, you need to set up the tooling to monitor how much resources your pods are actually using. This information is crucial for making proper decisions based on data, rather than speculation or previous configurations on virtual machines or previous environment setup. It's important to have monitoring in place as rightsizing is a continuous process, not something you do just once. Applications are constantly changing, either adding or removing functionality, which affects their resource needs.Note that the ecosystem of tools and platforms to help you monitor and rightsize your applications is growing. Some of these options are paid, but as your Kubernetes cluster keeps expanding, paid options might give you time to focus on solving customer problems rather than trying to reinvent the wheel. In this book, we'll rely on open-source tools like Prometheus and Grafana to help you gain hands-on experience without having to analyze which tool is best for your specific use case. However, keep in mind that in the long run, you might derive more value from using a third-party tool tailored to your needs.

Prometheus and Grafana

Prometheus is an open-source solution that digs into your cluster, pulling metrics from all over - the API server, your nodes, and even individual pods. It doesn't just gather data; it transforms your cluster into a data-rich environment, enabling you to query data and alert based on certain rules you specify. Grafana, complementing Prometheus, helps you visualize this data through interactive dashboards. When combined, these tools provide an excellent starting point for observability. Prometheus collects and stores metrics of your cluster's performance, while Grafana transforms this data into visual dashboards, allowing you to easily identify trends, anomalies, and potential issues. This combination not only helps you observe but also understand and optimize your Kubernetes resource request configurations for pods, which is something you'd want to do regularly.

Hands-On: Setting up Prometheus and Grafana

Head over to your terminal, and make sure you have your cluster up and running. If you need to bring it up again, go back to the Chapter 1 to do it, and come back later to this.To install Prometheus and Grafana using Helm, you first need to add the Helm Chart repository from Prometheus, update the Helm repos, and install the Prometheus stack. To do so, run these commands:

$ helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
$ helm repo update
$ helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace

When the installation is done, you should be able to open Grafana. To do so, you first need to get the administrator user and password to login.To get the Grafana admin user, run this command:

$ kubectl get secret prometheus-grafana -n monitoring \
-o jsonpath="{.data.admin-user}" | base64 --decode ; echo

To get the Grafana admin password, run this command:

$ kubectl get secret prometheus-grafana -n monitoring \
-o jsonpath="{.data.admin-password}" | base64 --decode ; echo

Finally, to access Grafana, run this command:

$ kubectl port-forward service/prometheus-grafana 3000:80 -n monitoring

Open a web browser, and enter http://localhost:3000/. You should see the Grafana login page like below, enter the user and password and click on Log In:

For now, that's it, you'll comeback to Grafana in a moment.

Hands-On: Determining the right size of an application

Now that you have access to Grafana and Prometheus, let's deploy an application to the cluster, make some calls, and analyze the resource utilization. To get started, run the following command to deploy an application to run Pi simulations using the Monte Carlos method (it involves running multiple simulations with random sampling to obtain numerical results):

$ kubectl create deployment montecarlo-pi \
--image=christianhxc/montecarlo-pi:latest

This command creates a Kubernetes deployment using default values. You can explore the pod definition, but you won't see any resources section. Nonetheless, notice that the pod has been scheduled despite not having configured resource requests.Now, expose the deployment through a service by running this command:

$ kubectl expose deployment montecarlo-pi --port=80 --target-port=8080

Similar as before, this command creates a Kubernetes service using the default values.Let's generate some traffic to the application by making a call every 10 milliseconds:

$ kubectl run -i --tty load-generator --rm --image=busybox:1.28 \
--restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q \
-O- http://montecarlo-pi/monte-carlo-pi?iterations=10000000; done"

This command creates a pod that will terminate once you stop the command. This pod is using a BusyBox image to call the service endpoint using the wget command.As a result, you'll see the output you get after every call to the application. As long as you don't see any type of error, leave that running for about three to five minutes. Then, go back to Grafana, and open the Dashboards page located at the following URL: http://localhost:3000/dashboards. The Grafana stack you've deployed, comes with a default set of dashboards. Let's open the one named Kubernetes | Compute Resources | Workload, change the time range to see the data from the last five minutes.

Note

If the Kubernetes | Compute Resources | Workload dashboard is not available in Grafana by default, we have the JSON model at the GitHub repository under the file: /chapter02/grafana/kubernetes_resource_workload.json. To import it, you need to create a new dashboard, and import it using a JSON model.

You should see something similar to the following image:

Figure 2.6 – Dashboard "Kubernetes / Compute Resources / Workload" showing the CPU usage

For now, let's focus only on CPU usage. In Figure 2.6, you can see that the application is using 976m of CPU in average. Stop the command (Ctrl + C) you had run to generate traffic to the application. You'll see how the CPU usage starts to go down. Now that you have data to support how much CPU your application is using, let's configure proper CPU requests by adding an extra 20% by requesting 1200m of CPU. As this might be the maximum usage the pod could have, you're giving some threshold in case there's a traffic spike. Run the following command to adjust the resources requests:

$ kubectl patch deployment montecarlo-pi \
--patch '{"spec": {"template": {"spec": {"containers": [ \
{"name": "montecarlo-pi", \
"resources": {"requests": {"cpu": "1200m"}}}]}}}}'

Let's send some traffic again for another three to five minutes:

$ kubectl run -i --tty load-generator --rm --image=busybox:1.28 \
--restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q \
-O- http://montecarlo-pi/monte-carlo-pi?iterations=10000000; done"

After five minutes, stop the command (Ctrl + C), and go back to Grafana. You should now see a CPU Requests % column that tells you how much of the requested resources the pod is using, like the below image:

Figure 2.7 – Dashboard showing the CPU used and requested.

In this case, we could say that the pod has an 80.8% of efficiency, which is not a bad number. You could tune in the requests to achieve higher or lower percentage, but you need to make sure that by changing these values you don't affect the application's performance. We'll use this data to later configure effective autoscaling policies.

Establishing defaults

As you've seen by now, proper monitoring is crucial for setting up appropriate resource requests and limits configurations to pods. However, for some organizations, this process of rightsizing might take some time to implement properly. And now that you understand the importance of rightsizing, you may want to at least enforce defaults when pods are deployed to Kubernetes without specified request values.In Kubernetes, you can establish default value configurations for memory and CPU at the namespace level without any third-party tools. Alternatively, if you need more granular control or different mechanisms, you can configure mutating admission webhooks to modify a pod before it reaches the Kubernetes API server and enforce default request values. Tools like Kyverno, Open Policy Agent (OPA), or Gatekeeper are candidates to consider for this purpose.

Establish default requests and limits

Kubernetes provides a LimitRange object to set default values for requests and/or limits for CPU and memory if a container doesn't specify them. You can also configure minimum and/or maximum values that a container or a pod must meet; otherwise, the pod won't be created in that namespace. LimitRange rules apply only to new pods, it doesn't modify the ones that are already running in the cluster. This also applies if you modify a LimitRange.Note that even though you configure these values at the namespace level, the constraints or configurations apply to individual pods. If you want to limit the total resources all pods (aggregated) within a namespace can use, you should use the ResourceQuota object instead. A ResourceQuota is an object that sets aggregate resource consumption limits for a namespace, controlling the total amount of CPU, memory, and other resources that can be used by all pods within that namespace.

Note

LimitRange can be configured to enforce storage requests and/or limits values for PersistentVolumeClaim as well, but we won't cover that in this book.

Here's an example of a LimitRange policy to set default requests and limits values:

apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-defaults
spec:
  limits:
  - default:
      cpu: 1200m
    defaultRequest:
      cpu: 900m
    type: Container

With this rule, you're telling Kubernetes that every container in a new pod within this namespace will have default requests and limits values if not explicitly specified. If you want to apply this configuration to the pod as a whole rather than individual containers, the type should be set to Pod. Following what you did in the monitoring section, if you determine that a container is using a maximum of 900m of CPU, you might configure a defaultRequest (requests) of 900m and a default (limits) of 1200m.While LimitRange is a good protection mechanism when teams are not accustomed to setting request values for containers in a pod, I strongly recommend aiming to configure request values for each container individually. These values should reflect the actual resource usage of the containers. Applying default values indiscriminately might lead to overprovisioning of nodes. Additionally, if you set a rule type for a pod, you might lose control over the number of containers a pod could have.Now, let's get hands-on again and explore how to set up default requests for memory and CPU at the namespace level in Kubernetes.

Hands-On: Setup default requests for CPU and Memory

Get back to the command line, and let's remove the previous deployment:

$ kubectl delete deployment montecarlo-pi

Let's create a dedicated namespace to configure default requests for future pods:

$ kubectl create namespace montecarlo-pi

Next, create a LimiRange rule to configure default requests for CPU and memory:

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-memory-requests
  namespace: montecarlo-pi
spec:
  limits:
    - defaultRequest:
        cpu: 900m
        memory: 512Mi
      type: Container
EOF

Deploy the same application as the previous section, the Monte Carlo simulation:

$ kubectl create deployment montecarlo-pi \
--image=christianhxc/montecarlo-pi:latest -n montecarlo-pi

You didn't configure any resource requests, but the LimitRange rule should have modified the pods. Inspect the pod created using this command:

$ kubectl get pod -n montecarlo-pi -o yaml

You should see the pod specification in YAML format, but pay close attention to the resources section from the container:

  ...
  spec:
    containers:
    - image: christianhxc/montecarlo-pi:latest
      imagePullPolicy: Always
      name: montecarlo-pi
      resources:
        requests:
          cpu: 900m
          memory: 512Mi
  ...

As you can see, the safety mechanism to set up default configuration values has worked effectively. With this in place, you're now ready to configure autoscaling policies for your workloads in a consistent and deterministic manner.Let's explore a few options you have for scaling your applications.

Workload Autoscalers

The reason I've been discussing how the kube-scheduler works and emphasizing the importance of configuring proper request values is to lay the groundwork for setting up scaling policies based on accurate data. This approach will help your workloads become more efficient, reducing waste without compromising application performance.As mentioned earlier, this book will focus on three key workload autoscalers: HPA, VPA, and KEDA. I'll introduce each by explaining their purpose, the problems they solve, and how they might be used in combination. In the following chapters, we'll dive deeper into the details of how to implement and optimize these autoscalers, ensuring you can leverage their full potential.

Horizontal Pod Autoscaler (HPA)

HPA is often the starting point, and sometimes the only option people use for autoscaling their workloads. It's simple and a native resource of Kubernetes. HPA has been an integral part of the ecosystem since its introduction in Kubernetes 1.1. Developed by the Kubernetes community, HPA addresses the challenge of managing fluctuating workloads running in Kubernetes. It automatically adjusts the number of pod replicas (horizontal scaling) for compatible Kubernetes objects such as Deployments, ReplicaSets, and StatefulSets.HPA's primary purpose is to solve the problem of efficiently scaling applications in response to varying demand. It does this by monitoring specified metrics and adjusting the number of replicas accordingly. While CPU utilization is the default metric, HPA can also work with memory usage, custom metrics, and even external metrics. These metrics are sourced from various components within the Kubernetes ecosystem. The Metrics Server provides CPU and memory data, while systems like Prometheus can supply custom metrics. For more complex scenarios, external monitoring systems can feed metrics into HPA.To illustrate HPA's functionality, consider a web application like the Monte Carlo simulation we've been using, experiencing varying traffic throughout the day. HPA can be configured to maintain an average CPU utilization of 70% across all pods. As shown in Figure 2.5, as traffic increases and CPU usage rises above this threshold, HPA will automatically increase the number of pod replicas. Conversely, during periods of low traffic, it will scale down the number of replicas, optimizing resource usage and potentially reducing costs.

Figure 2.8 – HPA scaling out based on CPU utilization

HPA efficiently manages the number of pod replicas but doesn't adjust resources for individual pods nor containers. When it comes to adjusting resources allocated to individual pods HPA isn't the solution. For this type of scaling, we have VPA.

Vertical Pod Autoscaler (VPA)

VPA is designed to automatically adjust the CPU and memory resources allocated to containers in pods. Unlike its horizontal counterpart HPA, VPA focuses on optimizing the resource requests and limits of individual containers, rather than scaling the number of pod replicas. VPA was introduced by Google in 2018 as part of their efforts to enhance Kubernetes' autoscaling capabilities. While it's not a native Kubernetes resource like HPA, VPA has become part of many Kubernetes deployments due to its ability to fine-tune resource allocation.VPA can be applied to various Kubernetes objects, including Deployments, ReplicaSets, StatefulSets, DaemonSets, and even individual pods. VPA addresses the common problem of over- or under-provisioning resources, which can lead to either wasted cluster capacity or performance issues. To make its scaling decisions, VPA relies on historical and current resource usage metrics. It primarily focuses on CPU and memory utilization, gathering this data from the Metrics Server.Consider a scenario with a large, legacy monolithic application or a complex database system that's challenging to scale horizontally due to intricate internal dependencies or stateful components. In Figure 2.6, as the data grows, its resource needs increase. VPA would continuously monitor this application's resource usage to recommend or adjust its requests accordingly. Conversively, if it determines that the system is consistently using less resources, it can recommend or automatically apply lower resource requests.

Figure 2.9 – VPA scaling up based on CPU utilization

When VPA needs to adjust the resources for a pod, it typically does so by deleting the existing pod and creating a new one with the updated resource specifications. This process can lead to a brief period of downtime for that specific pod. Additionally, even though VPA can work alongside HPA, they should be used carefully when combined, as their actions can potentially conflict. We'll address these concerns in the next chapter.

Kubernetes Event-Driven Autoscaling (KEDA)

HPA provides a solid foundation for autoscaling in Kubernetes, but relying solely on CPU and memory metrics may not capture the full picture of how an application performs under varying loads. Certainly, HPA can use custom metrics, but this might increase complexity when configuring scaling policies. This is where KEDA comes into play, offering a flexible but simple approach to scaling.KEDA is an open-source project that was initially developed by Microsoft and Red Hat in 2019. Although it's not a native Kubernetes resource, KEDA has gained significant traction in the community due to its versatility and power. It extends Kubernetes' autoscaling capabilities beyond traditional resource metrics, allowing for scaling based on event sources and custom metrics.Similarly to the other autoscalers, KEDA can be applied to various Kubernetes objects, including Deployments, StatefulSets, and custom resources. Its primary purpose is to scale applications based on events and application-specific metrics, rather than just system-level resource utilization. This makes KEDA particularly useful for event-driven architectures, microservices, and applications with unpredictable or bursty workloads.One of KEDA's key differantiators is its ability to use a wide range of metrics for scaling decisions. Unlike HPA, which primarily focuses on CPU and memory, KEDA can scale based on metrics from various sources such as message queues, databases, streaming platforms, and other external systems. This allows for more nuanced and application-aware scaling decisions.For example, as per Figure 2.7, consider a microservice that processes orders from a message queue:

Figure 2.10 – KEDA scaling out based on Latency

As evident in the figure, with KEDA, you could scale this service based on the number of messages in the queue. As the queue length grows, KEDA would automatically scale up the number of pods to process orders more quickly. When the queue empties, it would scale back down, optimizing resource usage.KEDA is not without its limitations. You might find that a specific connector you need isn't available, or conversely, you might feel overwhelmed by the sheer number of connectors to choose from. It's worth noting, though, that the KEDA maintainers team is addressing quality concerns by requiring any new connector to implement automated test suites, ensuring reliability and consistency across the growing ecosystem of scalersThat's all for the basics of Kubernetes workload autoscalers. Each has its strengths, and knowing when to use HPA, VPA, or KEDA can make a big difference in how your workloads perform. Let's dive deeper into these tools in the following chapters, and see how you can use them to automatically scale your workloads to start working on having an efficient Kubernetes cluster.

Summary

In this chapter, we've explored the fundamental concepts of workload autoscaling in Kubernetes, emphasizing the importance of efficient resource management and the challenges that come with it. We've seen how proper monitoring and rightsizing are crucial for optimizing your cluster's performance and cost-efficiency.We introduced three key autoscaling mechanisms: HPA, VPA, and KEDA. Each of these tools addresses different aspects of the autoscaling challenge. HPA focuses on adjusting the number of pod replicas based primarily on CPU and memory metrics. VPA complements this by fine-tuning the resource allocation within individual pods, which is particularly useful for applications that are difficult to scale horizontally. KEDA extends these capabilities further by enabling scaling based on events and application-specific metrics, offering more granular control over your scaling policies.By leveraging the right combination of these tools, you can ensure that your applications respond dynamically to changing workloads while optimizing resource utilization. In the next chapter, we will delve deeper into implementing HPA and VPA.

3 Workload Autoscaling with HPA and VPA

After exploring the workload autoscaling fundamentals in Kubernetes, it's time to turn our attention to two of the primary autoscaling mechanisms: the Horizontal Pod Autoscaler (HPA) and the Vertical Pod Autoscaler (VPA). This chapter will provide an in-depth look at HPA and VPA, expanding on the foundational concepts we've already discussed. We'll start by diving deep into HPA, examining its basic functionality and its integration with the Kubernetes Metrics Server.Through hands-on exercises, you'll learn how to implement scaling using both basic and custom metrics with HPA. We'll cover best practices for configuring autoscaling policies with HPA. Next, we'll shift the focus to the VPA. We'll break down how it works and the different ways you can use it. You'll gain practical experience in implementing automatic vertical scaling and leveraging VPA's recommender mode for rightsizing your applications.As we progress, we'll see how HPA and VPA can work together to keep your workloads running smoothly and efficiently, demonstrating how these tools can be combined to achieve continuous workload efficiency.By the end of this chapter, you'll have a thorough understanding of HPA and VPA, their configuration options, and best practices for implementation. This deep dive into these two critical autoscaling tools will set the stage for KEDA in the subsequent chapter, further expanding your workload autoscaling capabilities in Kubernetes.In this chapter, we'll cover the following topics:

The Kubernetes metrics server
Horizontal Pod Autoscaler: Basics
HPA custom metrics
Vertical Pod Autoscaler: Basics
How to work with HPA and VPA together

Technical Requirements

For this chapter, you'll continue using the Kubernetes cluster you created in Chapter 1. You can continue working with the local Kubernetes cluster. If you turned down the cluster, make sure you bring it up for every hands-on lab in this chapter. You don't need to install any additional tools as most of the commands are going to be run using kubectl and helm. You can find the YAML manifests for all resources you're going to create in this chapter in the chapter03 folder from the Book's GitHub repository you already cloned in Chapter 1.

The Kubernetes Metrics Server

Before diving into the details of how to use HPA, let me cover a crucial component: the Kubernetes Metrics Server. In the previous chapter, I explored the use of Prometheus and Grafana for rightsizing pods based on historical usage patterns. However, HPA, by default, utilizes a different source of metrics to determine when to scale pod replicas.The Metrics Server plays a crucial role in this process, providing real-time resource utilization data that HPA relies on for making scaling decisions. It serves as the primary source of container resource metrics for Kubernetes' autoscaling mechanisms.In the following sections, I'll cover the what and why of the Metrics Server in detail, discussing its purpose, importance, and implementation within a Kubernetes cluster.

Metric Server: The What and the Why

The Kubernetes Metrics Server is a cluster-wide aggregator of resource usage data. While not a built-in component of Kubernetes, it's a crucial add-on that collects and aggregates essential metrics about the resource consumption of nodes and pods in a cluster. The Metrics Server is designed to support Kubernetes' built-in autoscaling mechanisms. It focuses solely on gathering CPU and memory usage for nodes and pods, making it a lean and purpose-built tool.As shown in Figure 3.1, the Metrics Server operates by collecting resource metrics from the kubelet on each node. It aggregates this data and makes it available through the Kubernetes Metrics API. The process begins with the kubelet collecting resource usage statistics via its cAdvisor integration. The Metrics Server then queries each node's kubelet for this data every 15 seconds by default. After aggregating the data, it exposes it through the Metrics API, allowing Kubernetes components like HPA to query for up-to-date metrics.

Figure 3.1 – How HPA collects metrics for CPU and memory from the applications

As you've seen, the Metrics Server only provides the current state of resource usage, not historical data. Its focus is solely on CPU and memory metrics, and the data is stored in memory rather than persisted to disk. Metrics Server is solely used for autoscaling your workloads. For monitoring needs, solutions like Prometheus, or any other paid solution, will server you better for monitoring. You can even use Prometheus as a source for scaling using custom metrics, I'll cover this later in this chapter.Now that you have a better understand the Metrics Server, and why it's important, let's set it up in a Kubernetes cluster.

Hands-On: Setting up Metrics Server

Note

If you're using the Kubernetes cluster in AWS created using the Terraform templates in Chapter 1, you can skip this section as the metrics server comes pre-installed in the Terraform blueprint I've provided for the book.

To install the Metrics Server using Helm, follow these steps.Add the official Metrics Server Helm repository:

$ helm repo add metrics-server \
https://kubernetes-sigs.github.io/metrics-server/

Update your Helm repository cache:

$ helm repo update

Install the Metrics Server:

$ helm install metrics-server metrics-server/metrics-server

Optionally, if you're using a local or self-hosted Kubernetes cluster, you might need to disable TLS certificate verification. In this case, use the following command instead:

$ helm install metrics-server metrics-server/metrics-server \
--set args="{--kubelet-insecure-tls}"

Verify the installation by checking if the Metrics Server pod is running:

$ kubectl get pods -n kube-system | grep metrics-server

Once these steps are completed, your Metrics Server should be up and running. You can verify that this pod is properly configured with CPU (100m) and memory (200Mi) requests. While you can adjust this configuration if needed, these default settings should provide good performance for most clusters with up to 100 nodes.

Hands-On: Using Metrics Server

Once you have this addon installed, HPA and VPA will use it behind the scenes. It will also expose some of the data it collects through the kubectl top command. This command allows you to see how much resources either nodes or pods are consuming.For instance, let's view the resource usage of all the nodes in your cluster with:

$ kubectl top nodes

You should see an output similar to this (depending on how many nodes):

NAME         CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-0-...  1676m        42%    1691Mi          11%
ip-10-0-...  2402m        61%    3209Mi          21%

As you can see, it's a much simpler version of the top command in Linux, typically used for monitoring purposes. It only shows you the CPU millicores and Memory bytes each node has, and what percentage of those resources are being used at the moment you ran the command.You can also see the resource usage for all the pods in a given namespace:

$ kubectl top pods -n default

You should see an output similar to this (depending on which pods you have running):

NAME                             CPU(cores)   MEMORY(bytes)
montecarlo-pi-5d9d967b77-hh87b   14m          36Mi

Notice that for pods, it only shows you the usage numbers, not the percentage. But that's sufficient for HPA and VPA. Remember, this command and the Metrics Server shouldn't be used as a monitoring tool. However, for a quick status overview, it's adequate.If you want to learn more about other options like sorting or filtering resources, you can add the --help flag to each command.Now that you have Metrics Server working, let's get into the details of how to configure autoscaling policies with HPA and VPA.

Horizontal Pod Autoscaler: Basics

I introduced you to HPA in the previous chapter. The main job of HPA is to handle dynamic workloads by automatically adjusting the number of pod replicas based on demand to a Deployment, a StatefulSet, or any resource that implements the scale subresource. HPA typically looks at CPU or memory usage to make its decisions, but it's not limited to just those metrics. It can also work with custom metrics or external data, but I'll cover this later in this chapter. So, how does HPA make the decision to either add or remove pods, or simply not act? Let's see.

How HPA scale resources?

HPA operates on a feedback loop, using a simple calculation to determine the number of replicas needed. Let's say you have a Deployment targeting 70% CPU utilization across all its pods. The process begins with HPA querying the Metrics Server (typically every 15 seconds) to get the current CPU utilization for all pods in the target Deployment. It then calculates the average CPU utilization across these pods.HPA uses the following formula to determine the desired number of replicas:

desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

Following with the 70% CPU target example, if the current average CPU utilization (currentMetricValue) is 80% and we have 5 replicas, HPA would scale out to 6 replicas:

desiredReplicas = ceil[5 * (80 / 70)] = 6

If the current utilization then drops to 60%, HPA would go back to 5 replicas:

desiredReplicas = ceil[5 * (60 / 70)] = 5

When you're using multiple metrics, the desiredReplicas calculation is done for each metric, and the largest number is chosen. Keep this in mind as this will become an important aspect in Chapter 5. It's important to note that HPA has a tolerance (default 0.1 or 10%) to prevent unnecessary scaling for minor fluctuations. If the ratio of currentMetricValue to desiredMetricValue is between 0.9 and 1.1, HPA won't initiate any scaling action.After determining the desiredReplicas value, HPA updates the replica count on the Deployment object. Kubernetes then creates or removes pods to match this new desired state. This feedback loop continues indefinitely, allowing your application to automatically adjust to changing load conditions. However, to prevent rapid fluctuations, HPA typically observes a cooldown period (default is 5 minutes) after each scaling action before performing another.

Defining HPA Scaling Policies

To define an autoscaling policy with HPA, you need to create a Kubernetes object of kind HorizontalPodAutoscaler. The current API version for this resource is autoscaling/v2. Below is an example of the simplified YAML manifest for an HPA object, which we'll break down and explain in detail:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cpu-usage-70
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: montecarlo-pi
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      ...
    scaleUp:
      ...

Let's break down the key fields:

scaleTargetRef: Specifies the target resource to scale, like a Deployment.
minReplicas: Define the lower bound for the number of replicas.
maxReplicas: Define the upper bound for the number of replicas.
metrics: Specifies the metric(s) to use for scaling, like CPU utilization.
behavior: Defines policies for scaling up and down to prevent rapid fluctuations.

While using YAML manifests is recommended for Infrastructure as Code (IaC) practices, you can also create an HPA rule using kubectl. Here's an example command:

$ kubectl autoscale deployment montecarlo-pi \
--cpu-percent=70 --min=1 --max=10

This command creates an HPA rule for the montecarlo-pi deployment, targeting 70% CPU utilization, with a minimum of 1 replica and a maximum of 10 replicas. It's exactly the same as the YAML manifest you reviewed before.

Hands-On: Scaling using basic metrics with HPA

Let's put in practice what we've learned so far. Grab your computer, and open the CLI.We're going to use ApacheBench (ab), a benchmarking tool from Apache, to send load to the application. You don't need to install this tool locally as we're going to run it within the Kubernetes cluster.To have an almost real-time visibility of what's happening while you see HPA in action, we're going to use the watch command. This comes natively on Linux, but if you're using Mac or Windows, you need to install watch. In Mac, you simply need to run brew install watch. In Windows, use the Windows Subsystem for Linux (WSL) or install it using choco install watch.Go to the book's GitHub repository folder that you've already cloned, and change to the chapter03 directory. But in case you haven't done it, run these commands:

$ git clone https://github.com/PacktPublishing/Kubernetes-Autoscaling.git
$ cd Kubernetes-Autoscaling/chapter03

Deploy the sample application

Run the following command to deploy the sample application we're going to use:

$ kubectl apply -f hpa-basic/montecarlopi.yaml

It's the same application as before, but this time it comes with the CPU and memory requests configured as you learned already how important it is. Make sure the application is working by running the following command:

$ kubectl port-forward svc/montecarlo-pi 8080:80

Open the following URL in a browser:

http://localhost:8080/monte-carlo-pi?iterations=100000

You should see an estimated value of Pi, meaning that the application is working. You can close the port-forward command as it's not needed anymore, we're going to use ab to send load to the application.

Create the HPA autoscaling policy

Now that the application is running, before you send some load to see autoscaling in action, let's create the autoscaling policy. Run this command:

$ kubectl apply -f hpa-basic/hpa.yaml

I'm not including the YAML manifest definition here as it's the same we explored before, but please open it and confirm that it's an HPA rule for the montecarlo-pi deployment, targeting 70% CPU utilization, with a range of 1 to 10 replicas. It doesn't have a scaling behavior configured yet, so it's relying on defaults for that section.Confirm that the HPA rule is there with this command:

$ kubectl get hpa

You should see an output similar to this:

NAME    REFERENCE  TARGETS      MINPODS  MAXPODS  REPLICAS  AGE
mont... Deploy...  cpu: 0%/70%  1        10       1         41s

Look at the TARGETS column, it's using 0% of CPU initially. It might take 15 seconds until you see a number, so you might need to run the command again to see it.

Run load tests

By now, you have everything you need to scale your application automatically. So, let's send some load to see HPA in action. To do so, you have two options. The first one is to expose the application or use port-forward, then run ab locally. But a simpler option is to run ab within Kubernetes. Let's go with that option.In the chapter03 folder, there's already a YAML manifest to create a Kubernetes Job that uses a container image with ab pre-installed, and it's configured to send load to the montecarlo-pi service. Explore the YAML file and you'll see that it's running this command:

ab -n 10000 -c 10 -t 150 \
http://montecarlo-pi/monte-carlo-pi?iterations=100000

This command means that ab is going to send 10,000 requests with 10 concurrent calls, and it will run for a maximum of 150 seconds. When ab finishes, you'll get a report of how it went, we'll explore that later.So, run the following command to deploy the ab loadtest:

$ kubectl apply -f ab-k8s/loadtest.yaml

Watch autoscaling working

Open four additional terminals with the following commands. In the first one, you're going to watch the CPU usage of all the pods by running this command:

$ watch kubectl top pods

In the second one, you're going to watch all the pods running in the cluster:

$ watch kubectl get pods

In the third one, you're going to watch the ab logs:

$ kubectl logs job/montecarlo-pi-load-test -f

In the fourth one, you're going to watch the HPA scaling rule:

$ kubectl get hpa

Pay close attention of what's happening in these three new terminals. Little by little you'll see that the CPU usage of the application is going up (around 3000m), and HPA is going to start changing the Deployment replicas to maintain a target CPU usage of 70%.Notice what HPA is saying initially on the TARGETS column. As ab is sending the load all at once, the CPU of the single replica is going to have a minimum of 3000m usage, and HPA is going to say that it's 324% usage, way more than the 70% target.However, you might see that HPA is taking a bit to scale up. Why? Remember that HPA relies on the Metrics Server, which collects data every 15-30 seconds, and HPA also has its own check interval of 15 seconds by default. HPA's scaling algorithm is designed to prevent rapid fluctuations, potentially waiting to confirm persistent load increases before initiating scaling actions.If you want to speed up the scaling up process, you can change its behaviour to this:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: montecarlo-pi-hpa
spec:
  # ... other fields ...
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15

This configuration removes the stabilization window for scaling up and allows 100% increase every 15 seconds.To scale down the replicas, HPA takes way more time. Why? HPA uses a 5-minute window (by default) to calculate the desired number of replicas for scaling down. Scaling down too quickly could lead to service disruptions if traffic suddenly increases again.If you want to speed up the scaling down process, you can change its behaviour to this:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: montecarlo-pi-hpa
spec:
  # ... other fields ...
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 60  # 1m instead of 5m
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15

Similar to the scale up configuration, this configuration allows 100% decrease every 15 seconds, but the stabilization window for scaling down is 60 seconds instead of 5 minutes.Let's repeat the testing with the above HPA configuration changes to see the application scaling up much faster, and scaling down moderately faster. Remove the ab job first:

$ kubectl delete -f ab-k8s

Then, deploy the new HPA version. You'll find the YAML manifest in the hpa.fast.yaml file, which is essentially the same as you had before but with the behavior section. Please explore it, and then run the following command to update it:

$ kubectl apply -f hpa-basic/hpa.fast.yaml

Run the ab job to start over with the load tests:

$ kubectl apply -f ab-k8s/loadtest.yaml

Now, watch the four terminals you opened earlier to monitor the pods and HPA. You should see that pods are scaling up and down much faster than before, and the CPU usage is reaching its target much quicker too. It's important to run these types of tests to adjust the scaling speed to your specific needs.To clean up the environment after this hands-on lab, run these commands:

$ kubectl delete -f ab-k8s
$ kubectl delete -f hpa-basic

Now that you've seen how HPA works with CPU-based scaling, let's go a step further. CPU and memory are useful, but they don't always tell the whole story. What if your app needs to scale based on how many requests it's receiving, how long a queue is, or even business-specific metrics like active sessions?In the next section, we'll explore how to extend HPA to use custom metrics for a more tailored scaling decision.

HPA and Custom Metrics

You now know the basics of using HPA with CPU and memory metrics. But these two metrics may not always capture the full picture of your application's scaling needs. For some applications, scaling behavior might be better determined by other metrics such as latency, number of requests, or queue length.But how can HPA scale using other metrics? By either using customer or external metrics. Kubernetes offers the Metrics API, and it provides access to HPA to use different types of metrics through APIs, like:

Resource Metrics API: It's the core API that provides CPU and memory usage for pods and nodes, and it's primarily used by HPA, as you've seen before in Figure 3.1.
Custom Metrics API: It allows for the exposure of application-specific metrics. It's extensible, enabling you to define and use metrics that are tailored to your specific applications. We're going to see this in action later in this chapter.
External Metrics API: It provides access to metrics from sources outside the cluster, such as cloud provider metrics. This API is not covered in this book.

The Metrics API acts as an abstraction layer, providing a standardized way for Kubernetes components to request and receive metric data, regardless of the underlying metrics collection system. Like you explored already with HPA using the metrics provided by the Metrics Server. This abstraction is what allows Kubernetes to work with various monitoring solutions, including Prometheus, without needing to understand the specifics of each system.We've explored how to use the Resources Metrics API already, and feed it with the Metrics Server component. Now, we'll explore how to make HPA to use the Custom Metrics API. In this book, we'll cover how HPA can use External Metrics API using the Kubernetes Event-Driven Autoscaling (KEDA) project in Chapter 4.

How does HPA work with custom metrics?

For HPA to use custom metrics, you need two key components:

A way to collect and store these metrics
An adapter that can expose these metrics through the Custom Metrics API

You might be able to achieve this with your existing monitoring system. As shown in Figure 3.2, a common approach involves integrating Prometheus with the Prometheus Adapter. In this setup, Prometheus serves as the metrics collector and storage solution. The Prometheus Adapter then acts as a bridge, implementing the Custom Metrics API and exposing the data collected by Prometheus through this standardized interface. This combination allows for HPA make scaling decisions based on a custom set of metrics different than CPU and memory, like latency or number of requests.

Figure 3.2 – How HPA collects custom from Prometheus

Let's get into the details of what you saw in Figure 3.2 flow. The Prometheus Adapter registers itself with the Kubernetes API server as an API Service. This registration informs Kubernetes that the adapter can handle requests for Custom Metrics. The adapter is then configured with rules that define how to translate Prometheus metrics into Kubernetes Custom Metrics, specifying which Prometheus metrics to expose and how to name them in the Kubernetes API.When HPA requests a metric, the API server forwards this request to the Prometheus Adapter. The adapter translates this into a Prometheus query, executes it against the Prometheus server, and then translates the results back into the format expected by Kubernetes.The adapter exposes these metrics at an API endpoint that follows the Kubernetes custom metrics API format. For example, a metric might be accessible at a path like /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/monte_carlo_latency_seconds. HPA can then be configured to use these custom metrics in its scaling decisions. Let's see that in action.

Hands-On: Scaling using custom metrics with HPA

Before you get started, make sure to have the Prometheus stack running. You already did this in the previous chapter, but just in case you need to install the stack again, here's the command to do it:

$ helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

Confirm that all the pods are running using this command:

$ kubectl get pods -n monitoring

Deploy the Prometheus Adapter

As you can see, this stack doesn't come with the Prometheus adapter, you need to install it separately. Go to the chapter03 folder from the Book's GitHub repository. Here you can find the prometheus-adapter/values.yaml file with the configuration you need to use the Prometheus service to collect metrics, and the query to expose the custom metrics from the application. Now, simply run the following command:

$ helm install prometheus-adapter \
prometheus-community/prometheus-adapter \
  --namespace monitoring \
  -f prometheus-adapter/values.yaml

Explore the configuration from the values.yaml file. It instructs the Prometheus Adapter to look for histogram metrics named monte_carlo_latency_seconds_bucket with non-empty namespace and pod labels. It then transforms these metrics into a custom Kubernetes metric named monte_carlo_latency_seconds, calculating the 95th percentile of the latency over a 2-minute window using the histogram_quantile function. This allows HPA to use this derived latency metric for scaling decisions, based on the recent performance of your Monte Carlo simulation pods.Explore the application source code at the chapter02/src/main.go, where it's using the Prometheus client library to expose the monte_carlo_latency_seconds metric using a Histogram to track the latency using monteCarloLatency.Observe(duration.Seconds()). This approach is chosen because Prometheus scrapes these metrics every 15 seconds, and you don't want to lose data points since the last time Prometheus scraped.

Note

We're adding this additional code so that you can see how to instrument an application to expose metrics. However, if you look closely at the /metrics endpoint, you'll find the http_request_duration_seconds_bucket metric that serves a similar purpose to the one we added. This is because the promhttp library already exposes a common set of metrics.

Deploy the sample application

Go back to the chapter03 folder, and re-deploy the Monte Carlo application:

$ kubectl apply -f hpa-custom/montecarlopi.yaml

Make sure the application is working by running the following command:

$ kubectl port-forward svc/montecarlo-pi 8080:80

Open the following URL in a browser:http://localhost:8080/monte-carlo-pi?iterations=100000Then, try the /metrics endpoint:

http://localhost:8080/metrics

You'll see an output similar to this one:

# TYPE monte_carlo_latency_seconds histogram
monte_carlo_latency_seconds_bucket{le="0.001"} 0
monte_carlo_latency_seconds_bucket{le="0.005"} 1
monte_carlo_latency_seconds_bucket{le="0.01"} 1
...
monte_carlo_latency_seconds_count 11

Deploy the Service Monitor

You now need a way to tell Prometheus to scrape the metrics from the application. To do so, you need to create ServiceMonitor, a custom resource definition where you configure which service endpoint to scrape. The manifest definition is like this:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: montecarlo-pi
  labels:
    app: montecarlo-pi
    release: prometheus
spec:
  selector:
    matchLabels:
      app: montecarlo-pi
  endpoints:
  - port: http

Deploy the service monitor using the following command:

$ kubectl apply -f hpa-custom/monitor.yaml

Wait around 30 seconds to confirm you can see the new custom metric:

$ kubectl get --raw \
"/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/"\
"monte_carlo_latency_seconds"

You should see an output similar to this one:

{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "describedObject": {
        "kind": "Pod",
        "namespace": "default",
        "name": "montecarlo-pi-574cc7ff9f-h77wp",
        "apiVersion": "/v1"
      },
      "metricName": "monte_carlo_latency_seconds",
      "timestamp": "2024-10-08T20:57:08Z",
      "value": "0",
      "selector": null
    }
  ]
}

You might not see any value yet, but that's fine, you'll generate some load later.

Create the HPA autoscaling policy

You're going to deploy the following HPA rule to use the application custom metric:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: montecarlo-pi-latency-hpa
spec:
  scaleTargetRef:
    apiVersio dn: apps/v1
    kind: Deployment
    name: montecarlo-pi
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: monte_carlo_latency_seconds
      target:
        type: AverageValue
        averageValue: 500m

This HPA rule is configured to automatically scale application based on the custom metric monte_carlo_latency_seconds. It aims to maintain an average latency of 500 milliseconds (500m) across all pods, scaling between 1 and 10 replicas as needed. The rule uses the Pods metric type and AverageValue target type, indicating it's working with a per-pod metric and trying to keep the average value across all pods at or below the specified threshold.

Run load tests and see HPA in action

Run the following command to send some load to the application for a few minutes:

$ kubectl apply -f ab-k8s/loadtest.yaml

Open three additional terminals with the following commands. In the first one, you're going to watch the CPU usage of all the pods by running this command:

$ watch kubectl top pods

In the second one, you're going to watch all the pods running in the cluster:

$ watch kubectl get pods

In the third one, you're going to watch the HPA scaling rule:

$ kubectl get hpa

Little by little you'll see that the CPU usage of the application is going up but not more than 1200m, contrary to what you saw in the previous hands-on lab. This is because the new version of the Deployment has limits configured, like this:

resources:
  requests:
    cpu: 900m
    memory: 512Mi
  limits:
    cpu: 1200m
    memory: 512Mi

This means that Kubernetes won't let the pods to consume more than 1200m, instead, Kubernetes will start throttling the container. Therefore, the latency will go up, and that's the reason why you see HPA changing the Deployment replicas to maintain a target latency of 500m. Same as before, when the latency goes back to 0m, HPA will start removing replicas. You can speed up this process in the same way you did in the previous lab.To clean up the environment after this hands-on lab, run these commands:

$ kubectl delete -f ab-k8s
$ kubectl delete -f hpa-custom

So far, you've explored how HPA can scale your application horizontally by adding more replicas based on resource usage or even custom metrics like latency. But what if scaling out isn't the best option? Some workloads might benefit more from having bigger pods rather than more pods, even if it's temporarily while you make you make your workload to scale horizontally.In the next section, we'll dive into Vertical Pod Autoscaler (VPA), a tool that helps you rightsize your pods by adjusting CPU and memory requests and limits. Unlike HPA, it focuses on optimizing the size of each pod instead of increasing the number of replicas. Let's take a closer look at how it works and when you should consider using it.

Vertical Pod Autoscaler: Basics

I introduced you to VPA in the previous chapter. Its primary function is to analyze and adjust—or provide recommendations for—the CPU and memory requirements of containers running in a pod. VPA doesn't add more replicas; it simply adjusts (or recommends) the size of a pod. VPA can be applied to vertically scale various Kubernetes objects including Deployments, ReplicaSets, StatefulSets, DaemonSets, and even individual pods. So, how does VPA make the decision to adjust the size of containers in a pod? Let's see.

How VPA scale resources?

VPA works with three different controllers you deploy in the cluster:

Recommender: this controller continuously monitors the resource usage of pods and containers, analyzing historical data to generate optimal resource recommendations. Then, it calculates the ideal CPU and memory requests for each container, taking into account factors like usage patterns, application behavior, and defined constraints.
Updater: this controller is responsible for applying the recommendations generated by the Recommender. It decides when and how to update the resource requests of running pods. Depending on the VPA mode (Auto, Recreate, or Initial), the Updater may evict pods to apply new resource settings or wait for pods to be naturally recreated during deployments or restarts.
Admission Controller: this controller intercepts pod creation requests. When a new pod is about to be created, it checks if there's a VPA policy applicable to that pod. If so, it modifies the pod's resource requests according to the latest recommendations before the pod is actually created. This ensures that even newly created pods start with optimized resource settings.

As shown in Figure 3.3, VPA starts by gathering CPU and memory usage data from pods in the cluster from the Metrics Server. It's an ongoing process, so it allows VPA to understand application's resource needs over time. Then, VPA generates recommendations for optimal CPU and memory allocations for each container within a pod. These recommendations aim to strike a balance between ensuring the application has enough resources to perform well, and avoiding over-provisioning that could lead to wasted resources.

Then, VPA can automatically apply these usage recommendations by updating the pod specifications by terminating the pod first. Then, through an Admission Controller, it modifies the pod spec before is persisted. Alternatively, it simply provides the recommendations without taking action. VPA respects configured resource policies, such as minimum and maximum limits. It also considers factors like Out of Memory (OOM) events and CPU throttling when making recommendations.Lastly, keep in mind that using VPA alongside HPA for CPU and memory scaling can lead to conflicts, as both tools attempt to solve the same problem in different ways. However, VPA can work well with HPA when the latter is scaling based on custom or external metrics.

Defining VPA Scaling Policies

Now let's look at how to define a VPA scaling policy. As VPA is not a native Kubernetes feature, you define autoscaling policies through a custom resource defintion (CRD) called VerticalPodAutoscaler.Here's a simple VPA rule:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: montecarlo-pi-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: montecarlo-pi
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: 100m
        memory: 512Mi
      maxAllowed:
        cpu: 1200m
        memory: 512Mi
      controlledResources: ["cpu", "memory"]

In this example, VPA will automatically adjust CPU and memory for all containers in the montecarlo-pi Deployment.Let's break down each section:

targetRef indicates which Kubernetes object this VPA should manage. In this case, it's a Deployment named montecarlo-pi.
updatePolicy section defines how VPA should apply changes, and there are four modes you can canfigure:
- Auto: assigns resource request when pods are created, or evict pods to update resources to existing pods. When in-place update feature (AEP-4016) is generally available (GA) in Kubernetes (it's currently in alpha), VPA will apply recommendations without disruption, and it will become the prefered mode.
- Recreate: apply recommendations at creation time, or recreates existing pods. For now, it's the same as Auto as in-place updates is not GA yet. However, it will be recommended to use only if you really need to recreate the pods when applying recommendations.
- Initial: apply recommendations only when pods are created.
- Off: recommendations can only be seen by inspecting the VPA object.
resourcePolicy section is used to define rules like limits VPA needs to respect
- containerPolicies allows you to set policies for specific or all containers
- minAllowed and maxAllowed set the resource boundaries to respect
- controlledResources is to specify which resources VPA should manage

This basic policy provides a good starting point, but remember that you can refine these settings based on your application's specific needs and your cluster's resources. You might set different limits for different containers, or create Pod Disruption Budgets (PDBs) to be less aggressive if you're dealing with a critical application.If you don't want to apply recommendations automatically, you can use the Initial mode to still get the benefits of VPA, but apply them the next time a deployment or upgrade happens, as this will create new pods. Alternatively, you can simply use Off mode to get recommendations from VPA, but you'll own the decision of applying them or not. You can also exclude containers that don't need to scale by using Off under the containerPolicies section.It's worth mentioning that VPA will update resources only if there is more than one replica. You can change this behavior either for all VPA rules within the VPA updater component using the min-replicas parameter, or by adding a minReplicas value to the corresponding VPA rule, like this:

...
updatePolicy:
    updateMode: 'Auto'
    minReplicas: 1
...

The recommendation is to configure this value at the VPA rule level, as there might be workloads for which you'd prefer VPA not to perform any action to avoid causing downtime.Now that you understand how to configure VPA policies and what each field does, it's time to see it in action. In the next section, you'll use VPA in Auto mode to dynamically adjust resource allocations based on real usage.

Hands-On: Automatic Vertical Scaling with VPA

To watch VPA in action, we're going to start using VPA in Auto mode to adjust the resources of pods (vertical scaling) without any human intervention.

Deploy VPA components

You first need to deploy the VPA components. To do so, run the following commands in a new terminal outside of this book's repository:

$ git clone https://github.com/kubernetes/autoscaler.git
$ cd autoscaler/vertical-pod-autoscaler
$ ./pkg/admission-controller/gencerts.sh
$ ./hack/vpa-up.sh

Confirm that the VPA pods are up and running:

$ kubectl get pods -n kube-system | grep vpa

Wait until you see the recommender, the updater, and the admission pods running. Then, close this terminal, and go back to the one you're using for the book's repository.

Deploy the sample application

Make sure you're in the chapter03 folder from the book's GitHub repository, and run the following command to create the VPA rule:

$ kubectl apply -f vpa/vpa.auto.yaml

Explore the VPA rule, you'll see that it's using the Monte Carlo deployment that you're about to deploy, and it's defining minimum and maximum values for CPU and memory. This is recommended so that you define boundaries for VPA to act, and you don't let the application to scale infinitely. Don't worry about setting these numbers right at the beginning, especially if you have no idea how much resources the application need. You can tune them up later when you know more about your application.Let's continue using the Monte Carlo application, deploy it using this command:

$ kubectl apply -f vpa/montecarlopi.yaml

Notice the montecarlo-pi deployment doesn't have any resources requests nor limits configured. This is so you can see how VPA will do that job for you based on the data about CPU and memory that collects from the metrics server, or initially, the default configuration the VPA might have.Open a new terminal to monitor how the VPA rule acts:

$ watch kubectl get vpa

Wait around one minute, and you should see something like this:

NAME            MODE   CPU    MEM       PROVIDED   AGE
montecarlo...   Auto   100m   262144k   True       60s

Describe the VPA rule to learn what recommendations is giving you, run this command:

$ kubectl describe montecarlo-pi-vpa-auto

You'll see something similar to this output:

...
  Recommendation:
    Container Recommendations:
      Container Name:  montecarlo-pi
      Lower Bound:
        Cpu:     100m
        Memory:  262144k
      Target:
        Cpu:     100m
        Memory:  262144k
      Uncapped Target:
        Cpu:     25m
        Memory:  262144k
      Upper Bound:
        Cpu:     100m
        Memory:  262144k
...

Notice that it has four groups of recommendations:

Lower Bound is the minimum resources values to apply.
Target are the values it will assign to the container when the pod is created.
Uncapped Target are the values to use if there were no minAllowed or maxAllowed constraints within the VPA rule.
Upper Bound is the maximum resources values to apply.

Open a new terminal to watch the pods created to see the changes VPA will make:

$ watch kubectl get pods

After two minutes have passed, you'll see that pods will start being re-created. If you describe the new pod, you should see that it has now requests assigned:

...
    resources:
      requests:
        cpu: 100m
        memory: 262144k
...

The pod has now 100m for CPU as that's the minimum allowed configured by the VPA rule. However, for the memory request, the pod has 262144k (256Mi). But the VPA has a minimum of 50Mi, right? Well, VPA assigned the default minimum from the VPA recommender pod of 256Mi. The default minimum from the recommender takes precedence if it's higher than what the VPA rule has configured. You can change this configuration with the pod-recommendation-min-memory-mb parameter from the VPA recommender deployment.

Run load tests and see VPA in action

Now, run the following command to send load to the application for a few minutes:

$ kubectl apply -f ab-k8s/loadtest.yaml

You should see that the VPA shows a different CPU value now, something like this:

NAME            MODE   CPU     MEM       PROVIDED   AGE
montecarlo...   Auto   1554m   262144k   True       6m

However, this time you won't see pods being re-created automatically, and if you check the VPA recommender logs, you'll see something like: not updating a short-lived pod. This is because by default, VPA will wait for at least 12 hours to evict pods. You can configure this with the in-recommendation-bounds-eviction-lifetime-threshold parameter at the VPA updater deployment.You might not want to wait too much time to see that in action, so let's force a rollout:

$ kubectl rollout restart deployment montecarlo-pi

Describe one of the new pods. You should see that it has requests assigned, like this:

...
    resources:
      requests:
        cpu: 1554m
        memory: 262144k
...

You might see that a pod replica is still Pending, check the pod events by describing the pod and you'll see Kubernetes reporting that couldn't find a node available to run the pod. This can be address by scaling the data plane with projects like Cluster Autoscaler and Karpenter. But we'll cover that in Chapter 7.To clean up the environment after this hands-on lab, run these commands:

$ kubectl delete -f ab-k8s
$ kubectl delete -f vpa

You've now seen VPA in action making recommendations and adjusting resource requests based on actual usage. A common question I hear is: can you use both HPA and VPA together? The answer is yes, with some important caveats.In the next section, we'll explore how HPA and VPA can complement each other, what limitations you need to be aware of, and how to configure them to avoid conflicts.

How to work with HPA and VPA together?

In the previous hands-on lab, you might have noticed that when you forced a new rollout to re-create the pods, VPA assigned CPU and memory resources based on what it knew at that moment. However, if you wait until the load test finishes and the application pods are no longer receiving traffic, the CPU usage drops to just a few millicores. You can confirm this with the kubectl top command.VPA takes a more conservative approach compared to HPA. This is because applications that need to scale vertically usually don't have very spiky behavior, and their load tends to be more predictable. Moreover, when VPA re-creates a pod, it could potentially cause downtime for your application, especially if you're only working with one replica.Another important aspect is that VPA and HPA perform different actions on your application, and if you use the same metric (e.g., CPU) to define autoscaling rules, you might end up with a race condition scenario. In this case, if CPU utilization goes beyond 70%, HPA will add more replicas, but by the time new pods are created, VPA might have decided to recommend higher CPU requests. Consequently, your level of efficiency could be very low, and you might end up with constant fluctuations within your application.Before you conclude that these two can't work together, let me share a few setups I've observed from different organizations that are using HPA and VPA in conjunction:

Use HPA to scale workloads horizontally, and only use VPA rules in Off mode to get resource request recommendations for CPU and memory. Then, use these recommendations to manually adjust the workloads and let HPA scale automatically.
Use only HPA to scale workloads horizontally. To rightsize the applications, use a monitoring stack like Prometheus + Grafana, Kubecost, or any other third-party tool that companies are already using. There are also some paid third-party tools that can perform rightsizing either automatically or with a one-click operation.
Use only VPA for applications that can't scale horizontally, either because they're legacy applications or adding a new replica is complex, costly, and time-consuming (e.g., due to data replication). In pre-production environments, VPA is usually configured in Auto mode, and for production environments in Off mode (to plan updates accordingly).

I've seen these options I just described being used for different types of workloads within the same company or project. There's no right or wrong answer; it's a matter of which setup works best for your application. As long as you make informed decisions based on data and try to automate as much as possible, you'll be taking advantage of HPA and VPA features. In the next two chapters, you'll learn about a project that takes a different approach to application workload autoscaling: KEDA.

Summary

In this chapter, we've explored two Kubernetes' autoscaling tools: Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA). We've seen how HPA dynamically adjusts the number of pod replicas based on observed metrics, primarily CPU and memory usage, but also custom and external metrics when properly configured. We've also delved into VPA's capabilities for automatically adjusting resource requests and limits for containers, optimizing resource allocation within pods.We've walked through practical examples of implementing both HPA and VPA, discussing their configuration options, best practices, and potential pitfalls. We've seen how these tools can significantly improve resource utilization and application performance when properly implemented.Throughout our exploration, we've emphasized the importance of making data-driven decisions when configuring autoscaling policies. We've discussed various approaches to combining HPA and VPA, or using them separately, depending on the specific needs and constraints of different applications and environments. The key takeaway is that there's no one-size-fits-all solution; the best autoscaling strategy depends on your specific use case and workload characteristics.As we conclude this chapter, it's clear that while HPA and VPA are powerful tools, they may not cover all autoscaling scenarios in complex, modern applications. This brings us to our next chapter, where we'll explore KEDA (Kubernetes Event-Driven Autoscaling). In the upcoming chapters, we'll dive deep into KEDA's concepts, its integration with Kubernetes, and how it can complement or even replace traditional autoscaling methods in certain scenarios.

4 Kubernetes Event-Driven Autoscaling (KEDA) – Part 1

In the previous chapter, we dived into autoscaling workloads using HPA and VPA. These solutions provide a very good starting point if your scaling needs are simply about CPU and memory, but it gets complicated if you need other types of metrics, especially for event-driven architectures. This is where Kubernetes Event-Driven Autoscaling (KEDA) comes into play.This chapter will start to dive deep into KEDA. We'll begin by understanding the need for KEDA and how it simplifies working with HPA. Then, we'll delve into KEDA's architecture and components, providing you with a good understanding of how it works. Then, we'll cover the KEDA's Custom Resource Definitions (CRDs) and practice using them.Moreover, we'll explore the wide array of event sources and scalers supported by KEDA, demonstrating its versatility in handling various workloads. You'll learn how to scale different Kubernetes resources, including Deployments, Jobs, and StatefulSets, with hands-on examples using metrics like latency and message queue length.We'll also introduce you to KEDA's HTTP Add-on, a feature for scaling HTTP-based applications. By the end of this chapter, you'll have a good understanding of KEDA, how it works, its capabilities, its limitations, and how to leverage it to build more responsive and efficient Kubernetes applications.Let's dive in and explore KEDA!In this chapter, we'll cover the following topics:

KEDA: What It Is and Why You Need It
KEDA's architecture
KEDA scalers
KEDA CRDs
Scaling deployments
Scaling jobs

Technical requirements

For this chapter, you'll continue using the Kubernetes cluster you created in Chapter 1. You can continue working with the local Kubernetes cluster; there's no need yet to work with a cluster in AWS. If you turned down the cluster, make sure you bring it up for every hands-on lab in this chapter. You don't need to install any additional tools as most of the commands are going to be run using kubectl and helm. You can find the YAML manifests for all resources you're going to create in this chapter in the chapter04 folder from the Book's GitHub repository: https://github.com/PacktPublishing/Kubernetes-Autoscaling .

KEDA: What it is and why you need it

Modern, complex, and distributed applications are not affected solely by CPU and memory utilization. These two metrics might only tell part of the story of what's impacting your application's performance, which ultimately determines why you might need more (or fewer) replicas for your applications to work as expected. But didn't we learn about using metrics other than CPU and memory in the previous chapter? It's possible to use custom or external metrics with HPA, and you even practiced it, right? So, what exactly is KEDA, and why do we need it?KEDA is an event-driven autoscaler for Kubernetes workloads. It was initiated as a joint collaboration between Microsoft and Red Hat in 2019. In 2020, KEDA was donated to the Cloud Native Computing Foundation (CNCF), making it vendor-neutral. Recently, in 2023, KEDA graduated as a CNCF project, which can tell you about its importance in the Kubernetes ecosystem.The primary purpose of KEDA is to simplify the process of working with custom and external metrics for scaling your applications. As you saw in the previous chapter, for HPA to work with custom metrics, you need an adapter component to translate those metrics so that Kubernetes can understand them. While you might not have issues working with this added complexity, it's important to remember that each component or layer between your metrics source and HPA rules can introduce latency, potentially causing your autoscaling rules to react slower than desired.KEDA's advantage lies in its ability to interact directly with the metrics source and expose those metrics to the External Metrics API. It works in tandem with HPA, acting as an abstraction layer to help you simplify the setup for using external metrics in HPA, without even requiring you to rewrite your application. This direct interaction can lead to faster and more efficient scaling decisions. However, it's recommended to not combine KEDA and HPA to scale the same workload to avoid any race condition.Moreover, KEDA is significantly more extensive and flexible than HPA alone. It supports a wide range of metric sources (or scalers), allowing you to scale your applications based on various event sources such as message queues, databases, and monitoring systems like Prometheus. More importantly, KEDA has the ability to scale from/to zero for most of the scalers, helping you to have a cost-effective resource utilization for applications with intermittent workloads.

Note

It's worth mentioning that since Kubernetes 1.16, you can enable the alpha feature gate called HPAScaleToZero to enable HPA to scale to zero. However, most cloud providers don't enable it by default. It's important to note that this feature has remained in alpha status for several versions, indicating that it may not be stable or widely adopted yet.

KEDA's architecture

Before you learn how to use KEDA, it's important to understand KEDA's architecture and components. At its core, KEDA consists of three main components: the KEDA Operator, the KEDA Metrics Server, and an Admission Webhook controller. The purpose of these three components is to provide you event-driven autoscaling for your workloads. The following figure illustrates the high-level architecture of KEDA and its interaction with your workloads:

Let me explain further what you see in Figure 4.1. When you need to configure an autoscaling rule using KEDA for a workload, you first need to create a rule using either a ScaledObject or a ScaledJob CRD, which I'll cover later in this chapter. Based on this, KEDA creates an HPA rule, and it uses the KEDA's metrics server to trigger a scale event. HPA continuously monitors the defined event sources to calculate how many replicas a workload needs. Based on this, HPA will scale down up to 1 (or the minimum number configured) replicas, and if zero replicas are needed (and the rule allows it), KEDA it will update the target resource directly to have zero replicas. However, If KEDA determines that more than one replica is needed, then it scales the target to 1 first to activate the trigger, then it delegates that action to HPA to scale to the required number of replicas. In other words, KEDA is responsible of scaling from 0 to 1, and from 1 to 0; and HPA is responsible for scaling from 1 to N, and N to 1 replica. Throughout this autoscaling process, the Admission Webhooks ensure that all KEDA-related operations adhere to best practices and prevent potential conflicts like having two ScaledObjects managing the same workload.Let's explore further each component and its role in the KEDA ecosystem.

KEDA Operator

The KEDA Operator is responsible for monitoring your cluster for KEDA-specific CRDs such as ScaledObjects and ScaledJobs. When it detects these custom resources, it starts to work by creating and managing the necessary HPA rules based on the scaling rules defined in these objects. Moreover, the KEDA Operator interacts directly with external event sources. Whether it's a message queue, a database, or a custom metric endpoint, the Operator can query these sources directly to determine the current scaling needs. As I said before, this direct interaction eliminates the need for intermediate adapters, reducing latency and improving the responsiveness of your autoscaling setup.

KEDA Metrics Server

The KEDA Metrics Server works alongside the Operator. This component acts as a bridge between KEDA and Kubernetes' autoscaling mechanisms. It exposes the metrics collected by the KEDA Operator to the Kubernetes Metrics API as external metrics, making them available for consumption by HPAs.The Metrics Server is what allows KEDA to act as an abstraction layer for HPA. It translates the rich, event-driven metrics that KEDA understands into a format that Kubernetes' native autoscaling can work with. In other words, it's the adapter. This integration means you can leverage KEDA's scaling capabilities without having to modify your existing Kubernetes setup or your applications.

Admission Webhooks

These webhooks serve as a crucial safeguard in the KEDA ecosystem, automatically validating resource changes to prevent misconfiguration and enforce best practices. They function as an admission controller, intercepting requests to the Kubernetes API server before the persistence of the object.One of the primary functions of KEDA's Admission Webhooks is to prevent multiple ScaledObjects from targeting the same scale target. This is crucial for maintaining the integrity and predictability of your scaling setup. The Admission Webhooks are enabled by default when you install KEDA, providing an additional layer of safety and consistency in your Kubernetes cluster(s).

What Kubernetes objects can KEDA scale?

KEDA can scale the following resource types:

Deployments: The most common use case, perfect for stateless applications.
StatefulSets: Ideal for applications that require stable, unique network identifiers or persistent storage like databases.
Custom Resources: KEDA can scale custom resources defined by Operators, opening up possibilities for scaling complex, application-specific workloads, like ArgoCD objects.

Let's dive deeper into the specific CRDs that KEDA introduces and how you can leverage them to define sophisticated scaling behaviors for your applications.

KEDA Scalers

One of the most important features that makes KEDA shine as a Kubernetes workload autoscaler is its event sources and scalers. These components are essentially the reason why KEDA is able to respond to a wide variety of external events, including basic ones like CPU and memory utilization - meaning that you won't need to use HPA and KEDA, you can transtition to use only KEDA. For instance, the a CPU trigger in KEDA will look like this:

triggers:
- type: cpu
  metricType: Utilization # Or 'AverageValue'
  metadata:
    value: "70"

For a memory scaler, simply change the type parameter to memory. Pretty similar to HPA, right? But KEDA goes far beyond these two metrics, allowing your applications to scale dynamically based on a diverse range of events and metrics, often working in conjunction with one another.Remember that KEDA continuously monitor sources. When you create a KEDA scaling rule, it does monitor sources for specific triggers that indicate a need for scaling.

Note

KEDA supports 70+ scalers, and keeps growing thanks to the community behind the project, covering a vast landscape of technologies, platforms, and cloud providers.

You can configure multiple triggers in the same KEDA's rule. With this, KEDA will start scaling as soon as one of the triggers specified meets the criteria. Then, it calculates the desired number of replicas for each trigger, and it will use the highest number.These scalers range from simple CPU and memory metrics to complex, application-specific indicators. For instance, you can scale based on the length of a message queue, the number of unprocessed items in a database, or even custom metrics exposed by your application. Each scaler is responsible for querying its associated event source at regular intervals. When the scaler detects that a predefined condition has been met - such as a queue length exceeding a certain threshold - it communicates this information back to the KEDA Operator. The Operator then uses this data to adjust the scaling of your application, either by creating new pods or removing unnecessary ones.For scenarios where you need even more flexibility or have stringent security requirements, KEDA offers the concept of external scalers. These are standalone services that implement KEDA's External Scaler API, allowing you to keep sensitive scaling logic or credentials outside of your Kubernetes cluster. External scalers can be particularly useful when dealing with proprietary systems or when you need to implement complex scaling logic that doesn't fit well within the constraints of a built-in scaler. We'll dive deeper into external scalers in Chapter 5.Let's now explore how to use KEDA with its CRDs.

KEDA CRDs

KEDA CRDs allow you to define and manage scaling behaviors in a Kubernetes-native way, extending the cluster's API to include KEDA-specific resources. Let's explore each of these CRDs and how they contribute to KEDA's scaling capabilities.

ScaledObjects

ScaledObjects define how a Deployment, StatefulSet, or a custom resource should scale using triggers. Based on the event type configured for the trigger, KEDA will use that scaler to monitor its events or metrics, and expose it so that HPA can use it to scale out or scale in.One of the key features of ScaledObjects is the ability to pause autoscaling, which can be particularly useful during maintenance windows or when you need to temporarily override KEDA's scaling decisions. We'll dive deep into more advanced features in Chapter 5.Here's an example of a ScaledObject that scales a deployment based on the length of a RabbitMQ queue:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: consumer-app-scaler
spec:
  scaleTargetRef:
    name: consumer-app
  pollingInterval: 30
  cooldownPeriod:  300
  minReplicaCount: 0
  maxReplicaCount: 30
  triggers:
  - type: rabbitmq
    metadata:
      queueName: orders
      mode: QueueLength
      value: "20"
    authenticationRef:
      name: consumer-app-trigger

This ScaledObject will scale the consumer-app deployment based on the length of the orders queue in RabbitMQ. It checks the queue every 30 seconds (pollingInterval), the application then consumes multiple events from the queue, and KEDA can scale the deployment up to 30 replicas (maxReplicaCount). When there are no messages, KEDA will scale down the deployment to zero replicas (minReplicaCount).Now, let's break down the parameters from the previous ScaledObject example:

scaleTargetRef is the resource KEDA will scale. If the target is a Deployment, you simply specify the name. But if the target is a StatefulSet or a custom resource, you need to specify the apiVersion, and its kind.
pollingInternval is the internal that KEDA use to check the trigger source. By default, the interval is 30 seconds.
cooldownInterval is the interval that KEDA waits after the last trigger was active before scaling to zero. By default, the internal is 300 seconds.
minReplicaCount is the minimum number of replicas KEDA will set the resource when scaling down. By default, the minimum number is zero.
maxReplicaCount is the maximum number of replicas KEDA will set the resource when scaling up. By default, the maximum number is 100.
triggers are where you define the properties of the scalers to use.

The ScaledObject CRD has many more parameters and properties that we'll continue exploring throughout this chapter and the next one. For now, you have at least an idea of what a basic configuration looks like. As we progress, you'll learn how to leverage these additional features to create more sophisticated and tailored scaling solutions for different use cases.

ScaledJobs

ScaledJobs are designed for batch jobs or tasks that need to run to completion. They create Kubernetes Jobs based on scaling rules. Instead of processing multiple events within a Deployment, KEDA will create one Job per event, and process it. When the Job finishes processing the event, or fails because there was an error, the Job will be terminated. It's up to the process, or application in this case, how many events will pull down, and based on this you need to configure the scaling rule.Here's an example of a ScaledJob to process messages from a RabbitMQ queue:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: rabbitmq-consumer
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: rabbitmq-client
          image: rabbitmq-client:v1.0.1
          command: ["receive",  "amqp://user:PASSWORD@rabbitmq.default.svc          .cluster.local:5672"]
          envFrom:
            - secretRef:
                name: rabbitmq-consumer-secrets
        restartPolicy: Never
  pollingInterval: 10
  minReplicaCount: 0
  maxReplicaCount: 30
  triggers:
  - type: rabbitmq
    metadata:
      queueName: orders
      mode: QueueLength
      value: "20"
      authenticationRef:
        name: consumer-app-trigger

This ScaledJob will create Jobs to process messages from the orders queue, when there are at least 20 messages in the queue (value). It will create up to 30 Job replicas (maxReplicaCount), and it will keep a history of 5 successful (successfulJobsHistoryLimit) and 2 failed jobs (failedJobsHistoryLimit).Now, let's break down the parameters from the previous ScaledJob example:

jobTargetRef is the spec you configure for the Job KEDA will create, and the template field is required for that same reason.
pollingInternval is the internal that KEDA use to check the trigger source. By default, the interval is 30 seconds.
minReplicaCount is the minimum of pods KEDA creates. By default, the minimum is zero. If it's different than zero, KEDA will keep a minimum of Jobs running, which it's useful in case you don't want to wait for new Jobs to be ready.
maxReplicaCount is the maximum number of pods KEDA will create every poll interval. If there are Jobs running, KEDA will deduct those pods. By default, the maximum number is 100.
triggers are where you define the properties of the scalers to use.

The ScaledJob CRD has many more parameters and properties that we'll continue exploring throughout this chapter and the next one. For now, you have at least an idea of what a basic configuration looks like.

TriggerAuthentication and ClusterTriggerAuthentication

There might be some KEDA scalers that will require authentication. To cover this need, the TriggerAuthentication and ClusterTriggerAuthentication CRDs provide a way to securely manage authentication parameters. The difference between these two is that TriggerAuthentication is namespace-scoped, while ClusterTriggerAuthentication is cluster-wide.Continuing with the RabbitMQ example, here's an example of a TriggerAuthentication that uses a Kubernetes secret to authenticate with the RabbitMQ server:

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: consumer-app-trigger
spec:
  secretTargetRef:
    - parameter: host
      name: rabbitmq-consumer-secret
      key: RabbitMqHost

With TriggerAuthentication, you decouple the authentication part from the scaling part, and you can re-use authentication within multiple ScaledObject instances. I showed you a very simple example of the CRD manifest, you can check the official documentation to see what else you can configure, and in the next chapter we'll explore a few other parameters. For example, you can configure a Pod Identity when using a cloud provider, use vault secrets from HashiCorp, or work with secrets management solutions from cloud providers like AWS Secrets Manager, Azure Key Vault, or GCP Secrets Manager.

Hands-On: Installing KEDA

Let's start with your first hands-on lab for this chapter. Go to your terminal, spin up your Kubernetes cluster again (go back to Chapter 1 to follow the lab to setup your environment), and run the following commands to install KEDA:

$ helm repo add kedacore https://kedacore.github.io/charts
$ helm repo update
$ helm install keda kedacore/keda --namespace keda --create-namespace

When the installation finishes, you should be able to see the KEDA controllers:

$ kubectl get pods -n keda

You should see an output like this:

NAME                        READY   STATUS    RESTARTS AGE
keda-admission-webhooks-... 1/1     Running   1        3m
keda-operator-...           1/1     Running   1        3m
keda-operator-metrics-...   1/1     Running   1        3m

Alright, you're now ready to use KEDA to scale your workloads. Let's do that next.

Scaling Deployments

As you can tell by now, KEDA offers a more flexible approach compared to traditional HPA methods. For instance, the ability to scale to zero, which isn't possible with standard HPA as the minimum number of replicas is always one. I'd say that this is one of the most attractive features of KEDA. However, scaling to zero might not be suitable for certain applications, such as web services where you don't want to miss any incoming requests. We'll explore this concept in more depth later in Chapter 5 when we discuss KEDA's HTTP Add-on.Similarly to HPA, KEDA also provides fine-grained control over scaling behavior. In addition to setting minimum and maximum replica counts, you can define cooldown periods to prevent rapid scaling oscillations and even implement advanced scaling policies. It's worth noting that while KEDA creates and manages HPAs behind the scenes, it abstracts away much of the complexity involved in setting up custom metrics adapters. This abstraction makes it significantly easier to implement advanced scaling scenarios that would be challenging with standard Kubernetes resources alone. You'll understand better what I mean after a few hands-on labs.In the following hands-on sections, we'll explore practical examples of using KEDA to scale Deployments. First, we'll be translating a custom metrics example from Chapter 3 into a KEDA-based solution, showcasing how KEDA simplifies the process. Then, we'll delve into a common scenario for scaling a queue consumer. Let's get into it.

Hands-On: Scaling using Latency

You've already practiced scaling your workloads using CPU utilization with HPA in the previous chapter. So instead of that, let's see how you'll scale your application using a custom metric. You already did this with HPA, but this time I want you to see the main difference in KEDA where you don't need the Prometheus Adapter, because KEDA will go directly to the source (Prometheus) to calculate the replicas the workload needs. This has other additional advantages like being able to use multiple Prometheus instances, and no need to configure an adapter through a ConfigMap (which has a limitation of 1 Mb size anyway).So, let's open a terminal, and change directory to the Book's GitHub repository at the chapter04 folder. Then, run the following command to deploy the Monte Carlo application:

$ kubectl apply -f api/montecarlopi.yaml

Notice that similarly as before, the application is intentional on how much CPU and memory is requesting, and it's limiting them as well to influence the response time when the load increase so that we can scale out horizontally:

        resources:
          requests:
            cpu: 900m
            memory: 512Mi
          limits:
            cpu: 1200m
            memory: 512Mi

Before you continue, let's confirm you have all the Prometheus pods running, as well as the Prometheus service by running this command:

$ kubectl get pods -n monitoring

Note

If you don't have the Prometheus stack installed, go back to Chapter 2 to install it.

The application is exposing custom latency metrics at the /metrics endpoint, and in order to push those metrics to Prometheus, you need to add a ServiceMonitor. There's a YAML definition for this in the repository as well, if you explore the file chapter04/monitor.yaml, you'll see that it simply targets the application:

spec:
  selector:
    matchLabels:
      app: montecarlo-pi
  endpoints:
  - port: http

Deploy the service monitor by running this command:

$ kubectl apply -f api/monitor.yaml

Wait around five minutes for Prometheus to consider this new monitor to start scrapping metrics from the application's /metrics endpoint. After this time has passed, deploy the KEDA's ScaledObject to define an autoscaling rule for the application using the monte_carlo_latency_seconds_bucket metric. You should be using the fillowing trigger:

triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090
      metricName: monte_carlo_latency_seconds
      threshold: "0.5" # You can't specify 500m, it needs to be translated
                       # to seconds
      query: sum(histogram_quantile(0.95,
             rate(monte_carlo_latency_seconds_bucket{namespace="default",
             pod=~"montecarlo-pi-.*"}[2m])))

Notice that it's pointing directly to the Prometheus service at the serverAddress parameter, and it's also defining a latency target of 0.5 seconds at the threshold parameter. Additionally, look at the query parameter. It has a very similar query from the one you used at Chapter 3 when using the Prometheus adapter.Deploy the above scaled object by running this command:

$ kubectl apply -f api/latency.yaml

Confirm that the HPA rule has been created by running this command:

$ kubectl get hpa keda-hpa-montecarlo-pi-latency

Let's see autoscaling in action by sending some load test, run this command:

$ kubectl apply -f ab-k8s/loadtest.yaml

Now monitor what happens by running this command:

$ watch kubectl get scaledobject,hpa,pods

You should see that the latency starts to increase by looking at the TARGET column of the HPA rule, and the number of replicas starts to increase very rapidly to try to keep the latency number at its target of 0.5 seconds (or 500ms). Even if with 10 replicas the latency doesn't meet its target, for this lab it's fine as we only wanted to see how KEDA's autoscaling rule reacts to a metric exposed by the application. You can play around by adjusting the maxReplicaCount parameter from the ScaledObject to see if with a maximum of 15 replicas the latency meets its target.Wait around 10 extra minutes to see how the number of replicas go back to 1 after the TARGET column from HPA goes back to 0.To clean up this lab, run the following commands:

$ kubectl delete -f api

Controlling autoscaling speed

In the previous chapter, when we dived deep into HPA speed of scaling, you learned you can control HPA's speed by configuring advance settings such as scaleUp and scaleDown behaviors. For instance, you can make KEDA to react faster when scaling up, and react more conservative (or faster than the default behavior) when scaling down. As KEDA is using HPA under the hood, you can customize the scaling speed through the advanced parameter in a ScaledObject, like this:

...
advanced:
    behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
...

Notice that we continue using the same configuration from the previous chapter. Feel free to give it a try to confirm that you were able to replicate the same autoscaling rule you had in HPA. But effectively, the previous configuration won't wait to scale up as the stabilizationWindowSeconds is 0, and will wait 60 seconds to starts scaling down. Both behaviors are using the Percent type, and the same policy configuration, which means that it can increase or decrease 100% of the current pods every 15 seconds (periodSeconds).

Scaling to Zero

One of the most attractive features of KEDA is the ability to scale down to zero replicas. You can reduce wast/costs when your application isn't needed. KEDA can bring your deployment to zero replicas when needed, and then scale it back up when activity resumes.Scaling to zero works well in certain scenarios. For example, if you're scaling based on the number of messages in a queue, KEDA can scale to zero when the queue is empty. However, scaling to zero isn't suitable for all types of applications or scalers. Take REST APIs as an example. If you're scaling based on latency or number of requests, scaling to zero can cause problems. When there are no replicas running, the application won't be able to handle any requests at all, leading to errors or timeouts. This can result in a poor user experience. Because of this limitation, scaling to zero it's best suited for event-driven workloads where you can afford to have periods of inactivity without impacting service availability. Although, it's worth noticing that scaling to zero even with HTTP-based applications is doable if using the Cron Job scaler or using Knative, we'll dive deep about scaling to zero HTTP-based applications in the next chapter.It's important to consider your application's nature and requirements before deciding to implement scaling to zero. For always-on services or APIs, maintaining at least one replica might be a better approach to ensure continuous availability.Let's see scaling from and to zero in action by scaling a queue-based workload.

Hands-On: Scaling from/to Zero

A queue-based workload is a scenario where scaling to zero makes perfect sense. If there are messages in the queue, KEDA activates the Deployment, add the necessary replicas to process messages in the queue. As messages are consumed, less replicas are needed. When there are no messages in the queue, KEDA pauses the Deployment in order to have zero replicas running.A simple way to demonstrate this scenario is to use a RabbitMQ queue. Think a video or image encoder, you might have a producer that sends messages to the queue, and a consumer that processes messages from the queue. We've provided a simple app that does something similar. I'll get into the details of how it was built, and why we're configuring KEDA in such a way.

Deploy a RabbitMQ queue

Let's start by deploying a RabbitMQ queue to your cluster by running these commands:

$ helm repo add bitnami https://charts.bitnami.com/bitnami
$ helm repo update
$ helm install rabbitmq --set auth.username=user \
--set auth.password=autoscaling bitnami/rabbitmq --wait

Wait around two minutes and the RabbitMQ pod should be running.

NOTE

If you're using Kind, you need to add the --set volumePermissions.enabled=true parameter to the helm install command.

It would be nice if you can keep an eye open to the messages that arrive to the queue. To do so, let's make use of RabbitMQ's UI. Run this command to expose the service locally:

$ kubectl port-forward svc/rabbitmq 15672:15672

Open the following URL in the browser: http://localhost:15672/#/queues. You should see a screen similar to the one below:

Figure 4.2 – RabbitMQ UI showing the "queues" tab.

You'll come back to this screen later.

Deploy the sample application

The application you'll deploy next it's a queue consumer that process a message one at a time (but you can configure it to process more), and to simulate it's doing something, it waits 3 to 5 seconds (randomly), then it acknowledges to the queue that the message has been processed. The app keeps processing messages from the queue, and it stops once there's a "silence" period of time (30 seconds by default) with no messages. As you can tell, it's a simple application that consumes messages in batch from a queue. Feel free to explore it and learn more about reading the source code at chapter04/src/consumer/main.go from the GitHub's repository.The important aspect I want you to notice is that it processes one message at a time, and it does it continuously as long as there are messages to consume. When you do it this way, it's easier to know how much resources the pod needs to request per unit of processing. Which then translates into making an efficient use of the resources available in the cluster. Moreover, this setup makes it easier to configure KEDA's autoscaling policies, which we'll explore in a moment.So, let's deploy the app together with the KEDA's ScaledObject, run this command

$ kubectl apply -f queue-deployment/consumer.yaml

The important bits to highlight are the deployment's configuration:

...
          env:
          - name: RABBITMQ_URL
            value: >
              "amqp://user:autoscaling@rabbitmq.default.svc.cluster.local:              5672"
          - name: QUEUE_NAME
            value: "autoscaling"
          - name: BATCH_SIZE
            value: "1"
          resources:
            requests:
              cpu: 300m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 128Mi
...

Notice that it's configured to process one message at a time with the BATCH_SIZE environment variable, and it's being intentional about how much resources it needs. The rest of the configuration it's verys standard. Now, look at the ScaledObject configuration:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: queue-consumer
spec:
  scaleTargetRef:
    name: queue-consumer
  pollingInterval: 5
  cooldownPeriod:  90
  maxReplicaCount: 15
  triggers:
  - type: rabbitmq
    metadata:
      queueName: autoscaling
      mode: QueueLength
      value: "5"
    authenticationRef:
      name: rabbitmq-auth

Let's see in detail a few of its parameters:

pollingInterval: it's checking every 5 seconds to see if there are new messages in the queue. Consider this value based on how much time it takes a message to be processed, and the max number of replicas you can launch.
cooldownPeriod: this one is important, it's the time it waits to start removing replicas when there are no more messages in the queue. Think about the last message a pod will process, and consider how much time as a maximum it will take to finish processing. Otherwise, you'll end up interrupting the process because of the scaling down event. If it's too big, you'll end up having the pod IDLE for more time, and if it's too low, you'll end up affecting the app's performance. You need to test to decide on a number that works for you.
maxReplicaCount: as the sky isn't the limit, you need to specify a limit on how many pods can be created. In this case, at a peak point, 15 pods will take care of processing all the messages from the queue. If you want messages to be processed quicker, one way is to need to increase this value. Just be mindful about the cluster's capacity, at this point, you haven't' configured node autoscaling yet. It's also worth mentioning that there's no minReplicaCount configuration, meaning that the default value of zero is used.
value: it's basically the number of messages in the queue (as mode is QueueLength) that will trigger this rule, as it's processing one message at a time, you could configure it to be 1, but I used 5 so you can see that you can delay a bit the scaling operation and launch initially more than one pod to start.

Deploy a message producer

We have also prepared a producer application that will send messages, all at once, to the RabbitMQ queue. You can configure the number of messages to generate using the MESSAGE_COUNT environment variable. For this lab, we'll be sending 150 messages all at once. You can run the producer as many times as you'd like to test this lab. So, to start sending messages to the queue, run this command:

$ kubectl apply -f queue-deployment/producer.yaml

You won't see much other than a new Kubernetes job being launched.

Watch KEDA in action

Go back to RabbitMQ's UI, to watch the autoscaling queue, you can go directly to this URL: http://localhost:15672/#/queues/%2F/autoscalingYou'll see that the messages have arrived to the queue, and some of them have been processed already. Look at the following screenshot for reference:

Figure 4.3 – RabbitMQ UI showing the messages from the autoscaling queue.

Then, in a new tab, run this command to watch how pods are being created:

$ watch kubectl get scaledobject,hpa,pods

Notice how KEDA started creating 4 to 5 pods to start processing messages. Then, it continues to create new pods as there are still messages in the queue. Finally, when all messages are processed, around three minutes after you launched the producer, you'll see how quickly pods were removed, keeping the number of replicas to zero. You might want to increase the cooldownPeriod to see how pods are kept after the process have finished. Even though this is great for cluster's efficiency, it might not be good for your application's performance, especially if you have a long-running process. In the next section we'll explore an alternative for this. Feel free to run this test again by deploying the producer application again.

Cleanup

To remove all resources created for this lab, run the following command:

$ kubectl delete -f queue-deployment

Scaling Jobs

In the previous lab, you saw how KEDA can scale from zero to a maximum number of replicas as long as there are messages in the queue to process. And it can go back to zero when there aren't messages to process. However, you might have noticed that KEDA is scaling down to quickly, and pods might have not finished processing a message.As suggested before, you can play around with the cooldownPeriod parameter by making it longer. You'll see how pods are kept there, and if you look at the logs, you'll see that they'll print a message saying the application is exiting as there are no messages to process. Then, Kubernetes will mark the pod as Completed, but then, pods will be restarted to start over again.Alternatively, you could handle the SIGTERM signal from within the application to delay the pod termination. We're not going to do that now, but you'll see in action later in this book in Chapter 12.So, instead of trying to figure out what's the proper cooldwonPeriod you need to configure, let's give it a try to KEDA's ScaledJob and instead of scaling a Kubernetes deployment, let's create a Kubernetes Job for every message, and let the Job complete when it has finished processing the message (or a batch of messages if you decide to change the BATCH_SIZE environment variable).ScaledJob instances are useful for long-running jobs where the cost of terminating a pod simply because there are no more messages in the queue, is too costly.Let's see that in action.

Hands-On: Scaling Jobs

During this hands-on lab you're going to practice deploying the same application as the previous lab, but as a set of Kubernetes jobs. Even though the sample application is not a representative of a long-running job, it's still serving the purpose of using KEDA to scale a job-based workload, let the job finished when it needs, and avoid trying to guess a proper configuration to avoid disruption when KEDA scales down. This process will make even more sense when we need to protect jobs when underutilized nodes are being deleted by projects like Karpenter.

Explore the ScaledJob rule

Before you start playing around, take a look at the new ScaledJob rule definition:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: queue-job-consumer
  namespace: default

Notice that the first section is very similar, the only difference is the kind value. The main difference relies in the spec section where you difine the job template:

spec:
  jobTargetRef:
    template:
      spec:
        containers:
        ... # Same spec definition as the Deployment one
  pollingInterval: 5
  maxReplicaCount: 15
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 2

I've removed the container's spec as it's the same as the Deployment one, but notice that we have included two new parameters: 1) successfulJobsHistoryLimit to keep one job that finished successfully so that you can check the logs to confirm there was no error when KEDA scales down, and 2) failedJobsHistoryLimit to keep to two failed jobs so that you can debug why the job failed. These two parameters are optional, and I added them simply for troubleshooting purposes. Failed and completed jobs are not reserving resources in the cluster.

Deploy the sample application

After exploring the ScaledJob rule, let's deploy it by running this command:

$ kubectl apply -f queue-job/consumer.yaml

At this point there aren't any messages in the queue, so no jobs should be launched but everything is ready now to consume messages when they arrive to the queue.

Deploy a message producer

Same as the previous lab, I've prepared a producer application that will send messages, all at once, to the RabbitMQ queue. Deploy the job using this command:

$ kubectl apply -f queue-job/producer.yaml

A new pod is created, and it will be removed after sending all the messages.

Watch KEDA in action

Let's launch the RabbitMQ's UI by running this command to expose the service locally:

$ kubectl port-forward svc/rabbitmq 15672:15672

Then, open http://localhost:15672/#/queues/%2F/autoscaling in the browser to monitor the messages in the autoscaling queue. You'll see that the messages have arrived to the queue, and they've started to being processed by the multiple jobs launched by KEDA.In a new terminal tab, run this command to watch how jobs are being created:

$ kubectl get scaledjob,jobs

Notice how KEDA created up 15 jobs at once to start processing messages. Let's look at the logs of the last completed job to confirm that it has completed without any errors. To do so, run the following command (replace 4cgp2 with the corresponding value):

$ kubectl logs job.batch/queue-job-consumer-4cgp2

You should get all the execution logs, but pay close attention to the last line:

... No new messages for 30s. Processed 21 messages. Exiting.

This means that the application finished without any problems, and more importantly, it wasn't interrupted when KEDA started to remove unnecessary jobs as there were no additional messages to process.

Cleanup

To remove all resources created for this lab, run the following command:

$ kubectl delete -f queue-job
$ helm uninstall rabbitmq

So that's it for now, you now have seen KEDA in action first by scaling the same application you used in the previous chapter when practicing with HPA. KEDA simplified the setup, and helped you to scale faster as it was going directly to the metrics source instead of relying on the feedback loop from Prometheus metrics adapter.But that's not really where KEDA shines. Therefore, I wanted you to practice the event-driven feature where KEDA shines, and it's basically the one that allows you to scale from/to zero replicas for certain type of workloads like processing jobs. We only used the RabbitMQ scaler in this chapter, but the philosophy for the rest of scalers is pretty similar. Lastly, I show you a more effective way to scale long-running jobs where the cost of interruption during the scaling down operation is too costly.

Summary

As you can tell by now, KEDA takes Kubernetes scaling to the next level, giving you more options to control how your application scales up and down. Besides the diverse set of scalers KEDA has to offer, one of the coolest things we looked at in this chapter was the ability for scaling from/to zero. This feature can really help cut down on resource waste and costs, especially for applications that don't need to run all the time. But we also saw that it's not a one-size-fits-all solution. For some apps, like REST APIs, scaling to zero can cause more problems than it solves.We explored KEDA CRDs, and particularly how to use ScaledObjects to manage scaling deployments by extending HPA capabilities. KEDA is still giving you a lot of control over things like cooldown periods and how many replicas you want running. Moreover, you saw ScaledJobs in action for those batch jobs or tasks that need to run until they're done, without getting cut off mid-process.Keep in mind that while KEDA gives you a better way for scaling workloads, you still need to think carefully about how to properly configure scaling rules based on what your application needs. It's all about finding the right balance for each specific app and situation. Head over to the next chapter to learn more about some advanced KEDA features.

5 Kubernetes Event-Driven Autoscaling (KEDA) – Part 2

In the previous chapter, I covered the basics of KEDA, why it has become so popular, and you were able to play around with its basic functionality. Now, I'm going a little bit deeper into some of KEDA's advanced features, accompanied by hands-on labs.To be more precise, in this chapter, I'll continue exploring a few other KEDA's scalers, introducing additional patterns and use cases that showcase KEDA's versatility by showing you very useful implementations I've seen among different companies. For instance, we'll explore how to implement time-based scaling to turn down replicas outside of your working hours to reduce waste by not having idle resources when you're not using them.Moreover, we'll also spend some time exploring KEDA's HTTP addon. For situations where a specific scaler isn't available, we'll discuss fallback strategies, including pausing autoscaling and caching metrics. As we progress, we'll dive into advanced autoscaling techniques, such as utilizing complex triggers with scaling modifiers and extending KEDA's capabilities through external scalers. We'll then shift our focus to KEDA's implementation on cloud providers like AWS, Azure, and Google Cloud. Moreover, you'll learn about the best practices around how to configure KEDA in a secure manner.By the end of this chapter, you'll have a much better understanding of KEDA's advanced features and be prepared to implement more sophisticated autoscaling strategies in your Kubernetes environments. Learning these techniques will help you optimize resource usage, improve application responsiveness, and maintain a more cost-effective, resilient infrastructure. Remember, our ultimate goal is to maintain an efficient Kubernetes cluster as much as possible.We will be covering the following main topics:

Autoscaling in KEDA continued
Scaling based on schedule
KEDA's HTTP Add-on
What if a KEDA scaler is not available?
Advanced autoscaling features
KEDA with Cloud Providers

Technical requirements

For this chapter, you'll continue using the Kubernetes cluster you created in Chapter 1. You can continue working with the local Kubernetes cluster for the first half of the chapter. If you turned down the cluster, make sure you bring it up for every hands-on lab in this chapter. You don't need to install any additional tools as most of the commands are going to be run using kubectl and helm. You can find the YAML manifests for all resources you're going to create in this chapter in the chapter05 folder from the Book's GitHub repository you already cloned in Chapter 1.The second part of this chapter is going to be hands-on with EKS, so you'll need to create an EKS cluster. The hands-on lab will tell you when you need to use it, but it's basically for the ones where we use the AWS scalers. You'll need to go back to Chapter 1 for the instructions on how to create an EKS cluster using Terraform. You'll simply need to reuse the template I created for the book, and run one single command. So, make sure you have access to an AWS account you could use to practice before you proceed.

Autoscaling in KEDA continued

In the previous chapter, we began to see KEDA in action by using the ScaledObject and ScaledJob CRDs. You also learned that there are multiple scalers available, and we could dedicate considerable time discussing each of them. However, the purpose of this book is to help you build a solid knowledge base and enable you to explore and adapt KEDA's features to your various application needs. That's why in this chapter, I want to start by exploring a few other common scenarios where KEDA can help you be more efficient with your infrastructure resources, which, in turn, will help you reduce some infrastructure costs.By now, you might have an idea of KEDA's capabilities, but there are many other features we didn't have time to discuss in the previous chapter. Because I believe these features are important and I don't want you to miss them, I've decided to dedicate some pages to more advanced features, patterns, practices, and integrations that you might need in the near future.So, let's dive in and start learning how to save money by deactivating workloads at certain times of the day.

Scaling based on schedule

You have already learned that KEDA can scale your workloads down to zero replicas. A frequent question I get from customers is, "How do I turn down the dev environment while the team is not using it?" And this becomes very important when your applications are running in the cloud, where you pay for what you use.KEDA has the cron scaler that you can use to define a start and end time to set a certain number of replicas to the target object. It's possible to define a timezone (the IANA Time Zone Database) and configure the schedule using the Linux format cron. In case you don't know or don't remember this format, it's basically like this:

* * * * *
| | | | |
| | | | +-- Day of the Week (0 - 6) (Sunday = 0)
| | | +---- Month (1 - 12)
| | +------ Day of the Month (1 - 31)
| +-------- Hour (0 - 23)
+---------- Minute (0 - 59)

In this bit of code, essentially each field can contain:

An asterisk (*) representing "every" unit (e.g., every minute, every hour)
A number to specify an exact value
A range (two numbers separated by a hyphen (-)) to specify a range of values
A list (comma-separated values) to specify multiple values

For instance, you can use 0 8 * * * to say 8 AM or 0 18 * * * to say 6 PM.To use the cron scaler, you'd define the following trigger:

triggers:
- type: cron
  metadata:
    timezone: US/Pacific
    start: 0 8 * * 1-5    # Start at 8:00 AM from Mon/Fri
    end: 0 18 * * 1-5     # End at 6:00 PM from Mon/Fri
    desiredReplicas: "1"

Effectively, this means that KEDA will scale the target to one replica from 8 AM to 6 PM from Monday to Friday using the Pacific time zone of the United States. Outside of this schedule, KEDA will scale the target to the minReplicaCount value (0 by default). However, it's worth mentioning that a limitation of this scaler is that it can't scale the target based on a recurring schedule. You'll learn in the next hands-on lab how to use this cron trigger in combination with another scaler to scale based on demand.

Note

There's a GitHub issue (#3356) discussing the possibilities of adding a new scaler to improve scaling using a time window, as the Cron scaler could cause confusion in some scenarios. At the time of writing, no decision has been made, but if you need to deactivate workloads during non-working hours, using the cron scaler is the only way to go, at least for now.

Hands-on lab: Scaling to zero during non-working hours

A very common scenario to optimize costs is to turn down environments during non-working hours; this, of course, applies to any pre-production environment. So, imagine that your workload is being used only from 8:00 AM to 6:00 PM from Monday to Friday. Additionally, within that schedule, you might want to continue using a CPU utilization trigger to spin up replicas for when to keep a CPU threshold. To do so, you need to do the following:You have to use both the cron and the cpu trigger within the ScaledJob rule, it should look like this:

...
spec:
  scaleTargetRef:
    name: montecarlo-pi
  minReplicaCount: 0
  maxReplicaCount: 10
  triggers:
  - type: cron
    metadata:
      timezone: US/Pacific
      start: 0 8 * * 1-5
      end: 0 18 * * 1-5
      desiredReplicas: "1"
  - type: cpu
    metricType: Utilization
    metadata:
      value: "70"

Notice that the previous ScaledObject is defining a minReplicaCount of zero. This means that outside of the schedule configured in the cron trigger, KEDA will set the target to zero replicas, which is what we wanted to achieve in the first place.To see this in action, let's deploy the Monte Carlo application and the ScaledObject rule. Go to the chapter05 folder from the GitHub repository, and run the following commands:

$ kubectl apply -f montecarlopi.yaml
$ kubectl apply -f cron

If you're running this lab during the schedule configuration, you should see one pod for the application running.

Note

You might need to adjust the schedule configuration to see it working, depending on when you're running this lab.

To confirm that the other trigger is working, let's send for a very short period of time some load to the application to make it use more CPU by running the following command:

$ kubectl apply -f ab-k8s

Then, to watch how pods are being created and how the triggers are making KEDA scale the workload to use more replicas, run the following command:

$ watch kubectl get scaledobject,hpa,pods

Notice that the cpu trigger started to work together with the cron trigger, and the replicas might grow to something around three replicas. When the load test finishes and the stabilization window from HPA has passed, KEDA will bring the replicas to 1 as per the schedule from the cron trigger.If you'd like to see how KEDA sets the replicas to zero, change the timezone configuration from the ScaledObject to a different one, like Europe/London. The idea is that you force KEDA to apply the configuration you might have outside of the schedule configured. The trigger should look like this:

- type: cron
    metadata:
      timezone: Europe/London

Apply the changes by running this command:

$ kubectl apply -f cron

Wait for the cooldownPeriod to finish, which by default is 300 seconds, and you'll see that KEDA removes the HPA rule.To clean up all the resources for this lab, run this command:

$ kubectl delete -f cron, montecarlopi.yaml, ab-k8s

In this lab, you learned how to combine a cron-based trigger with a CPU utilization trigger to create a cost-efficient autoscaling pattern as it optimizes costs by scaling to zero during non-working hours, while still being responsive to real-time CPU load during active periods.Next, let's explore a different approach to autoscaling based on incoming HTTP traffic using KEDA.

KEDA's HTTP add-on

Previously in the book, I mentioned KEDA's HTTP add-on is used to let your HTTP-based workloads scale to/from zero replicas. KEDA's HTTP add-on is an extension that provides a way to scale HTTP workloads based on incoming traffic, including the ability to scale to zero when there's no demand.The HTTP add-on introduces a custom resource called HTTPScaledObject where you can specify the application details, and the add-on takes care of creating the necessary Kubernetes resources to scale the application. With the HTTPScaledObject, you basically tell KEDA how to interact with an HTTP-based application and scale it based on how much traffic it should be able to handle. For instance, you can configure how to scale based on the request rate or concurrency. You can also set up hostname-based routing to manage multiple applications within the same cluster, each responding to its own domain name.

Note

While alternatives like Knative and OpenFaaS exist, KEDA's HTTP add-on focuses solely on HTTP autoscaling within the Kubernetes ecosystem. It integrates with KEDA core and HPA. However, it's important to note that KEDA's HTTP add-on is currently in beta status. This means that it may still undergo changes and improvements. Therefore, we've decided not to go in-depth exploring this add-on for now, but we thought it's an important add-on that might shape KEDA's future.

In this section, you learned how KEDA's HTTP add-on provides native support for autoscaling HTTP workloads based on traffic, and how it can help reduce resource usage by scaling to zero during idle times. This is particularly valuable for workloads that receive sporadic traffic or are based on schedule like a development environment.Now, let's shift focus to explore what to do when KEDA can't scale your workloads, either because a scaler is unavailable, misconfigured, or needs to be paused for operational reasons.

What if a KEDA scaler is not available?

There might be times when KEDA is not scaling applications because either there's an error with the scaler configuration, the scaler is not available, or maybe you need to momentarily pause any scaling action because you're in a maintenance window (e.g., performing an important deployment or upgrading the Kubernetes cluster). KEDA offers solutions for each of these cases.Let's explore each of them in more detail.

Caching metrics

The KEDA Metrics server queries the scaler with each request coming from HPA, which by default is every 15 seconds, or every time during the pollingInterval period, which by default is 30 seconds. When you have multiple ScaledObjects making queries every 15 seconds to the same scaler, the scaler might get throttled or be unresponsive due to the high demand. For this reason, KEDA has a property called useCachedMetrics, within a trigger section, to enable or disable (default) a cache for metric values from the scaler during the polling interval. This means that you could reduce to 50% (in the default scenario) the number of calls you make to the scaler, or even more if you extend the polling interval configuration. When you enable the metrics cache, every time a request to the KEDA metrics server reads the value from the cache instead of making a direct query to the external service.Let's continue using one example from previous labs and enable cache for the RabbitMQ trigger. The configuration will now look like this:

triggers:
  - type: rabbitmq
    useCachedMetrics: true
    metadata:
      queueName: autoscaling
      queueLength: "5"
    authenticationRef:
      name: rabbitmq-auth

Consider enabling cache for commonly used scalers for applications that can wait for some time to get a fresh value from the scaler. However, if your application needs to scale as soon as possible, caching the metric value might not be suitable. In this case, it's better not to use caching to ensure that KEDA retrieves the most up-to-date metrics directly from the source.

Note

You can't cache metrics for cpu, memory, cron scaler, or ScaledJobs.

Pausing autoscaling

Imagine that a KEDA scaler is not available, is not working properly, or you simply need to run a maintenance operation where any scaling action might actually cause a problem. KEDA has the ability to pause autoscaling, so instead of removing any ScaledObject resource, modifying its scaling behavior, or even making a change within the application, you simply need to add (or change) a special set of annotations in the ScaledObject resource.To pause a ScaledObject, you can add the following annotation(s):

metadata:
  annotations:
    autoscaling.keda.sh/paused-replicas: "0"
    autoscaling.keda.sh/paused: "true"

You don't need to specify both annotations; with either one or the other, you can pause autoscaling. However, in some scenarios, using both annotations together can be beneficial. This combination is useful during maintenance windows or operational tasks where you need to guarantee a stable replica count and prevent any scaling changes.Let me explain what each of these annotations is doing so that you can choose properly:

autoscaling.keda.sh/paused-replicas will set the desired number of replicas to the configured value in this annotation and will pause autoscaling. Use this annotation if you need to override the existing number of replicas for your workload. The existing number of replicas might be too low or too high, and you'd like to set the number of replicas that can offer service continuity for your application.
autoscaling.keda.sh/paused will pause autoscaling, but it will keep the existing number of replicas for your workload. This approach is simpler, and it might be the option that works most of the time, especially if the pausing period is short or scaling up or down is too costly (and not worth it).

To resume autoscaling, you can either remove the above annotations or simply change the autoscaling.keda.sh/paused annotation to false.

Fallback scaling actions

Imagine that you might have no idea that a scaler is not working, or something goes wrong with the scaler during non-working hours. You might have your monitoring systems configured to let you know about it, but as a safety mechanism, you might want to configure a ScaledObject to fallback to a certain number of replicas if KEDA fails to collect metrics from a target certain number of times.To configure fallback, you need to use the following section:

  fallback:
    failureThreshold: 3
    replicas: 6

In this code, the failureThreshold is the number of consecutive times KEDA fails to get metrics from the source. And replicas is the desired number of replicas you want to have if the failure threshold is met.

Note

You can't configure fallback for ScaledJobs objects, for cpu and memory scalers, or any other scaler whose metric type is a value, like the rabbitmq or cron scalers.

So far, you've learned how to handle scenarios when KEDA can't scale using tools like pausing and fallback configuration. These mechanisms ensure that your applications remain stable and available even if they can't scale.Now, let's dive deeper and look at how KEDA can support complex autoscaling logic with advanced features such as scaling modifiers and external scalers, which let you setup auto scaling rules to use very specific scaling requirements.

Advanced autoscaling features

After all the KEDA features we've explored so far, along with the vast number of scalers available, you might find yourself in the scenario where something is still missing to meet your scaling configuration needs. To try to cover complex autoscaling rules, KEDA offers advanced features that you might find useful, like scaling modifiers or using external scalers. Let's explore each of these options in more detail.

Note

We tried to cover the most common and stable KEDA features in this book, but KEDA might have added additional features that we didn't cover, or we didn't include them for different reasons. So, before you decide to use any of the upcoming advanced features, make sure you're not adding unnecessary complexity and understand very well what your autoscaling needs are.

Complex triggers with scaling modifiers

There might be times when, even with the ability to configure multiple triggers for complex autoscaling rules for your workloads, you might need to configure custom conditions to scale. Why? Consider that when you have multiple triggers, HPA calculates how many replicas are needed for each metric and uses the maximum one. If you have three cron triggers, the maximum desired number of replicas will be used. But maybe you'd like to sum all of them, particularly in scenarios where each trigger represents an independent source of expected load. Summing them can ensure that your application scales appropriately to handle combined demand across all those dimensions.Let me give you another example. Let's say you have an application that processes messages from a RabbitMQ queue and then persists some aggregated data into a MySQL database. Scaling only based on the number of messages in the queue might, at some point, cause problems for the database. You might have no problems processing tons of messages, but you might need to be careful not to bring the database down. Therefore, you need to consider metrics like write latency or replica lag time to decide if you want to scale based on the RabbitMQ trigger or stop adding replicas until the database is stable.To solve this problem, KEDA offers a feature called scalingModifiers that can help you take autoscaling to the next level. When you opt in for this option, KEDA will use this configuration to decide how to scale. You define a formula where you can reference metrics obtained from scalers and use mathematical and/or conditional statements. This formula returns a value, a composed metric, that joins all the metrics from the scalers that support scalingModifiers (i.e., CPU and memory are ignored) into only one, and the calculated value will be the one used to scale. In other words, the new composed metric takes precedence over what each scaler would have scaled independently.Let's continue using the previous example to help you understand how it works and how to use it. If the database write latency is below 100 ms, it means the database is working fine. Therefore, you can continue using the number of messages in the queue to decide how to scale. If not, you need to pause autoscaling by matching the formula value to the target value. It's like saying to KEDA, "Hey, we're good, no need to scale up or down."Here's how the configuration will look like:

...
advanced:
  scalingModifiers:
    target: "5"
    activationTarget: "1"
    metricType: "AverageValue"
    formula: "db_write_latency < 100 ? queue : 5"
...

Let's break down each of these parameters:

target is the new value you define to scale; it's the target the autoscaling rule needs to maintain. It will add or remove replicas to keep the target.
formula is where you manipulate the new metric that needs to be a single value (not boolean). You can use metrics from other scalers (in this example, it's using db_write_latency and queue scalers), use mathematical operations, and use conditional statements. KEDA uses the expr expression language, you can learn more about what expressions you can use at github.com/expr-lang/expr. If fallback is configured, the formula won't modify any metric.
activationTarget is optional, and it's the target value you define to activate scaling, in other words, to go from 0 to 1 replica. By default, it is 0.
metricType is optional, and it's the metric type to use for the composed metric from formula. By default, it is AverageValue, but it can be Value too.

Note

You can't configure scalingModifiers for ScaledJobs objects or for cpu and memory scalers.

Let's see scaling modifiers in action with the following hands-on lab.

Hands-on lab: Pausing autoscaling when resources are constrained

In Chapter 4, you already practiced scaling a Kubernetes deployment using a RabbitMQ trigger. I'm going to continue using that lab with the additional complexity I've been discussing to help you explore scaling modifiers. Imagine the application is persisting some processed data to a MySQL database. You want to avoid bringing down the database if too many messages end up arriving in the queue, and you need to scale up accordingly. To prevent that, we'll use an external metrics API to get the database write latency in milliseconds. If the write latency is 100 ms or higher, you need to pause autoscaling and resume it when the database is stable.Start by creating a RabbitMQ queue by running this command:

$ helm install rabbitmq --set auth.username=user \
--set auth.password=autoscaling bitnami/rabbitmq –wait

Next, deploy the consumer application that processes one message from the queue at a time. We won't explore the YAML manifest as it's the same as in the previous chapter. Run this command to deploy the consumer:

$ kubectl apply -f formula/consumer.yaml

For now, the consumer will have only one replica consuming messages.To make this lab easier to configure, I've created a dummy API that returns a static number representing the database write latency. It returns a JSON like this:

{
  "database": {
    "metrics": {
      "write_latency": 150
    }
  }
}

Deploy the dummy API by running this command:

$ kubectl apply -f formula/dummyapi.yaml

Confirm that the API is working by running this command:

$ kubectl port-forward svc/dummyapi 8080:80

Then, open http://localhost:8080/ in a new browser window. Notice that the write latency is 150, which means the database is not stable and the consumer shouldn't scale up with new messages in the queue.To configure that autoscaling rule, let's deploy a ScaledObject with two triggers. One for the RabbitMQ queue, and one to query the dummy API, like this:

...
  triggers:
  - type: rabbitmq
    name: queue
    metadata:
      queueName: autoscaling
      mode: QueueLength
      value: "5"
    authenticationRef:
      name: rabbitmq-auth
  - type: metrics-api
    name: db_write_latency
    metadata:
      targetValue: "100"
      url: "http://dummyapi.default.svc.cluster.local/"
      valueLocation: 'database.metrics.write_latency'
...

Notice that each trigger now has a name setup, which is required when using scalingModifiers as you need a way to reference them. Also, we're using the metrics-api scaler, which is used to scale based on a metric provided by an API you might own, like in this lab.Then, notice how these two triggers are used in the formula field to configure the autoscaling rule we saw previously. It should look like this:

...
  advanced:
    scalingModifiers:
      target: "5"
      formula: "db_write_latency < 100 ? queue : 5"
...

Remember, the formula is saying that when the database write latency is 100 ms or higher, pause autoscaling to avoid creating new replicas. You don't need to write this autoscaling rule from scratch; to deploy it, simply run the following command:

$ kubectl apply -f formula/scalingmodifiers.yaml

Before you start testing this scenario, let's run a command to see that no new replicas are being created when you send messages to the queue, as the database isn't stable for now. Run this command in a new terminal:

$ watch kubectl get pods,scaledobject,hpa

Ignore any CrashLoopBackOff error you might have; the pod will be restarted and it will work again as we discussed in Chapter 4. Notice that HPA only has one target and not two because even if you have two triggers, KEDA created the HPA object using the one you provided in the formula field, the output should look similar to this:

NAME  REFERENCE      TARGETS   MINPODS MAXPODS REPLICAS AGE
...   queue-consumer 5/5 (avg) 1       15      1        2m

Let's send a few messages to the queue by running the following command:

$ kubectl apply -f formula/producer.yaml

You might want to see how many messages you have in the queue, so run the following command to expose the RabbitMQ UI:

$ kubectl port-forward svc/rabbitmq 15672:15672

Then open http://localhost:15672/#/queues/%2F/autoscaling in a new browser window. When prompted for login, type user as the user, and type autoscaling for the password. You should see that messages are being consumed slowly, as you only have one replica running. If the consumer pod is in CrashLoopBackOff, simply delete it and let Kubernetes recreate it.Let's simulate that the database is stable now by changing the static value configuration from the dummy API deployment. You need to change the following environment variable:

        env:
        - name: DUMMY_VALUE
          value: "150"

To make this change, run the following command:

$ kubectl set env deployment/dummyapi DUMMY_VALUE=85

And this is when scaling modifiers shine. Notice that as soon as HPA is able to get the new value from the dummy API, new pods will be created, as the database is stable now. Messages from the queue will be consumed very rapidly now.To clean up the resources created by this lab, run these commands:

$ kubectl delete -f formula
$ helm uninstall rabbitmq

In this hands-on lab, you learned how to use scaling modifiers to build conditional autoscaling logic in KEDA. By combining multiple triggers, one for message queue length and another for database write latency, you were able to pause scaling when a backend system was under pressure and resume it once the system stabilized. This is useful when building context-aware autoscaling strategies that respond not only to application demand, but also to the health of dependent systems.Now let's look at how to go beyond those limits. In the next section, we'll explore external scalers, which allow you to extend KEDA's functionality when built-in scalers aren't enough.

Extending KEDA with external scalers

Another way to extend KEDA's capabilities is by using external scalers. KEDA offers several built-in scalers, which I'd recommend you explore before creating an external scaler. However, if the scaler you need doesn't exist yet, or your organization needs to own the code because you need to interact with private APIs, then you might consider creating an external scaler. Another reason might be that the scaler's complexity requires a separate controller to create additional resources, similar to what the HTTP Add-on does when creating a resource to hold incoming requests when scaling from 0 to 1. This book won't cover how to create an external scaler as it will need go deeper into coding one and will require its own chapter; instead, we'll simply cover at a high level what you'd need to do to create one, and how to use it.KEDA's built-in scalers run within the KEDA operator, while external scalers run outside of KEDA, and you'd need to communicate with them through a remote procedure call (or gRPC, an open-source framework for communication between services originally developed by Google). To build an external scaler, the endpoint needs to implement KEDA's built-in scaler interface so KEDA knows when the scaler is active or not, and can query metrics used to define how many replicas are needed.To use an external scaler, you have either the external or external-push scaler types. The main difference is that the external-push scaler is able to connect to an endpoint that can stream KEDA requests to receive updates from the scaler. The following is how you'd use an external scaler:

  triggers:
  - type: external
    metadata:
      scalerAddress: your-internal-scaler-endpoint:8080
      organization: eCommerce
      business_unit: Payment

The scalerAddress is the endpoint of the scaler API that implements the scaler interface, and the other two fields are custom fields used to pass parameters to the scaler. KEDA sends the entire metadata object to the external scaler endpoint. There are other fields like caCert, tlsCertFile, tlsClientCert, tlsClientKey, and unsafeSsl that are used to configure TLS authentication to the scaler endpoint.External scalers allow you to extend KEDA's scaling features to your needs, but you need to consider the added maintenance for operating and using an external scaler. So, make sure you have a valid reason to do it, and you've explored the existing list of scalers first.

KEDA with Cloud Providers

Another very common use case of KEDA is the integration with scalers from different cloud providers. Instead of having to build all the machinery to use metrics provided by the vast number of services from a cloud provider, you can simply make use of the scalers already available in KEDA. Without these scalers, you might need to use other tools like Prometheus to scrape metrics from cloud providers' services. This will not only make autoscaling for your applications complex, but it will also slow it down as you'd be adding too many components in the middle before making these metrics available to HPA (assuming you wouldn't go and build your own autoscaler).In the following sections, we'll explore briefly how to use KEDA with cloud providers. This integration is what truly makes event-driven autoscaling a reality in Kubernetes. You'll be able to see this in action in Chapter 11 and Chapter 12, where we'll be using Amazon EKS. However, in this chapter, we'll cover at a glance how KEDA can be used in Microsoft Azure and Google Cloud Platform. The most important aspect is the authentication. To use KEDA in a cloud provider, you need to consider which permissions are needed, which in terms of security, the less privileges it has, the better. As each cloud provider has its own set of services, identity configurations, and best practices, we won't go into much depth about it, as it might require a separate chapter for each one of them. However, we'll focus on the most important aspect: security.

KEDA on Amazon EKS

To use KEDA on Amazon EKS, the recommendation is that you first give KEDA's controller the capability to assume other IAM roles, as you don't want to give too many permissions to KEDA's controller for the following reasons:

For security purposes
You might not want to restart KEDA's controller every time you need to modify its permissions.

So, when you install KEDA, you need to configure which IAM role, which is tied to a Kubernetes service account, the controller will use. Regardless of the identity model you're using on EKS, IAM Roles for Service Accounts (IRSA), or EKS Pod Identity, a Kubernetes service account is tied to an IAM role. We'll skip these configuration steps and how it works for now, as we'll cover them in Chapter 11.Let's say that the Kubernetes service account you'll use for KEDA's controller will be named keda-operator. Therefore, if you're installing KEDA through Helm, you need to make sure the service account is properly configured. By default, it is like this:

serviceAccount:
  operator:
    create: true
    name: keda-operator
    automountServiceAccountToken: true
    annotations: {}

This service account will be tied to an IAM role called keda-operator as well, and how you do it depends whether you use IRSA or EKS Pod Identity. If you're using EKS Pod Identity, the IAM policy should look like this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ],
            "Condition": {
                "StringEquals": {
                          "aws:SourceAccount": "ACCOUNT_NUMBER"
                    },
                       "ArnEquals": {
                         "aws:SourceArn": "EKS_CLUSTER_ARN"
                   }
                 }
           }
    ]
}

This IAM policy would be the one that allows KEDA to assume other IAM roles, which in this case will be the ones assigned for workloads (pods) that you want to scale using KEDA. We're recommending this approach as the workload you want to scale is already interacting with an AWS service. Therefore, it already has the permissions to interact with the AWS API. For instance, if the workload is consuming messages from an AWS SQS queue, it should have permissions to at least read messages, delete messages, and describe the queue to get its queue length, something like this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ConsumeSQSQueue",
            "Effect": "Allow",
            "Action": [
                "sqs:DeleteMessage",
                "sqs:ReceiveMessage",
                "sqs:SendMessage",
                "sqs:GetQueueAttributes",
                "sqs:GetQueueUrl"
            ],
            "Resource": "arn:aws:sqs:${AWS_REGION}:${ACCOUNT_ID}:${QUEUE_NAME}"
        }
    ]
}

Once KEDA's controller is able to assume other IAM roles, and the workloads are able to interact with the AWS APIs, the next step you need to consider is the authentication configuration for the ScaledObject.We recommend that you use the authenticationRef property to delegate this configuration to the TriggerAuthentication object, which should have the following manifest:

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: aws-credentials
spec:
  podIdentity:
    provider: aws
    identityOwner: workload

Notice that you're saying it should use AWS as the provider and that it should take the identity (identityOwner) from the workload. KEDA's controller will assume the IAM role that the workload is using to collect the metrics for the AWS scaler configured in the triggers section from the ScaledObject, like this:

...
  triggers:
  - type: aws-sqs-queue
    authenticationRef:
      name: aws-credentials
    metadata:
      queueURL: autoscaling
      queueLength: "5"
      awsRegion: "eu-west-1"

The advantage of using a TriggerAuthentication to configure the AWS credentials is that you can reuse it within other ScaledObjects. Moreover, because we're recommending you to configure authentication using the workload's service account, it will simplify the permissions management for you as KEDA will assume the workload's IAM role, allowing you to configure only the permissions a scaler will need instead of giving KEDA's controller superpowers.

Note

At the time of writing this book, KEDA has five built-in scalers for AWS: AWS CloudWatch, AWS DynamoDB, AWS DynamoDB Streams, AWS Kinesis Stream, and AWS SQS Queue. And regarding authentication providers, KEDA has support for: AWS IRSA, and AWS Secret Manager to use secrets within KEDA's CRDs. AWS has an alternative, and recommended approach called EKS Pod Identity, which we'll explore deeper in Chapter 11.

KEDA on Azure Kubernetes Service

To use KEDA on Azure Kubernetes Service (AKS), you can simply enable the KEDA add-on that installs all the KEDA components integrated with AKS. You can enable it through an Azure Resource Manager (ARM) template by adding the following section:

"workloadAutoScalerProfile": {
        "keda": {
            "enabled": true
        }
    }

Or, you can also use the Azure CLI and run the following command:

$ az aks update --resource-group RESOURCE_GROUP_NAME \
--name AKS_CLUSTER_NAME --enable-keda

Notice that you basically add the --enable-keda parameter.

Note

As there are KEDA external scalers for AKS, like the one for Cosmos DB, these will need to be installed separately, as the KEDA add-on only installs KEDA's built-in scalers.

Once KEDA is enabled, you first need to enable the AKS cluster to interact with the Azure APIs that your workloads will need to scale. For this book, we go in depth with the commands you need to run. We'll only cover at a glance what the process looks like.To continue with KEDA's configuration, the next step is to create an Azure Identity for the workloads and the KEDA operator. Then, to avoid having to specify secrets, you need to create a federated credential between the Azure Identity and the Kubernetes service account that the workload you'll scale is going to use. You also need to create a second federated credential to give permissions to KEDA's controller.Then, depending on the Azure scaler you'll use, you need to provide the corresponding permissions to the Azure Identity you created. For instance, if you're using the Azure Service Bus scaler, you need to give permissions to this API to the Azure Identity.Now you need to create a TriggerAuthentication object to allow scalers to use the Azure APIs. The manifest should look like this:

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: azure-credentials
spec:
  podIdentity:
    provider:  azure-workload
    identityId: $MI_CLIENT_ID

Notice that the provider is an Azure workload, and that the identityId is the Azure Identity ID you created before. With this configuration, KEDA will be able to use the workload identity to collect metrics from the scaler configured to scale the workload.

Note

At the time of writing this book, KEDA has nine built-in scalers for Azure: Azure Monitor, Azure Pipelines, Azure Event Hubs, Azure Storage Queue, Azure Application Insights, Azure Data Explorer, Azure Log Analytics, Azure Service Bus, and Azure Blob Storage. Additionally, there are two external scalers: KEDA External Scale for Azure Cosmos DB and Durable Task KEDA External Scaler. And regarding authentication providers, KEDA has support for Azure AD Workload Identity and Azure Key Vault Secret.

KEDA on Google Kubernetes Engine

Using KEDA on Google Kubernetes Engine (GKE) is pretty similar to what you've seen so far. We won't go in depth on how you'd configure the security aspects of using KEDA on GKE, but we'll explore at a glance what the process looks like.First, you need to deploy KEDA's components to GKE, and the Kubernetes service account that KEDA's controller will use is going to be tied to a Google Service Account (GSA) that will have the proper permissions to use the API of Google Cloud Platform (GCP). Then, depending on the scalers you'll use to scale your workloads, you need to provide the corresponding permissions to the GSA account.Similarly, as with the other cloud providers, the recommended approach, if your workloads are running in GKE, is to use the TriggerAuthentication object, which for GCP, the manifest will look like this:

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: gcp-credentials
spec:
  podIdentity:
    provider: gcp

Notice that here you simply configure GCP as the provider. When your ScaledObject references this authentication object, KEDA's controller will impersonate the IAM service account the workloads you're scaling have.

Note

At the time of writing this book, KEDA has four built-in scalers for GCP: Google Cloud Platform Pub/Sub, Google Cloud Platform Storage, Google Cloud Platform Cloud Tasks, and Google Cloud Operations. And regarding authentication providers, KEDA has support for GCP Secret Manager and GCP Workload Identity.

As you can see, the approach is the same in all three cloud providers, but the difference remains in the service names each provider uses and how they implement such security configurations. The reason we dedicated a section to explore how KEDA works on cloud providers is to show you how KEDA adapts very well to what each of these cloud providers needs security-wise. Avoid giving extra permissions to KEDA's controller, and instead use the permissions each of the workloads that you're scaling have.

Summary

In this second part on KEDA, you were able to explore the benefits and autoscaling capabilities it brings to your applications. While you might have already had an idea from the previous chapter, in this chapter, I wanted to highlight very useful features that not many people know about KEDA. You might have heard that KEDA goes beyond CPU, Memory, and external metrics, and that you can use KEDA as an event-driven autoscaler for your workloads. However, in this chapter, I wanted to focus on a few advanced features, along with additional scaling scenarios where KEDA fits very well. Moreover, I wanted to give you an introduction to how KEDA considers security when you need to use scalers from cloud providers like AWS, Azure, or GCP. I could dedicate one single chapter to go in-depth on how to use KEDA for each cloud provider, but as you can see, KEDA works similarly for many of the scalers, and that's the intention of the project.You explored several features in this chapter, but one I'd like to highlight is scaling modifiers. This feature gives you a lot of flexibility to adapt to almost every autoscaling logic you might need. You can extend KEDA's behavior by pausing or even speeding up scaling based on custom conditions you might have.Alright, so that's it for now. Head over to the next chapter to close the autoscaling workload section of the book and learn about how you can do troubleshooting with KEDA, and a few other tools to help you optimize your workloads in Kubernetes.

6 Workload Autoscaling Operations

Up until this point, we've been exploring multiple ways to make your workloads more efficient in Kubernetes. In other words, we've discussed how to ensure your workloads are not wasting the resources available in the cluster. You've learned that to have consistent results, you need to be intentional about the number of CPU and memory requests your pods have. By doing so, you can then make use of tools like HPA, VPA, and KEDA to adjust the number of replicas based on utilization or events that impact the performance of your workloads.However, how do you know you're doing proper right-sizing for your workloads? How do you keep doing it continuously, as it's not a one-time task? And how do you know why your workloads might not be scaling? Well, you need to learn how to do proper troubleshooting, read logs, interpret metrics, and watch for events in the tools you're using.In this chapter, we'll focus on the operational aspect of autoscaling your workloads, with a focus on the efficiency aspect. You'll learn how to troubleshoot the different workload autoscaling tools we've explored so far: HPA, VPA, and KEDA. After reading Chapter 4 and Chapter 5, you might have decided to use (or switch to) KEDA for autoscaling your workloads, even if you're not quite there yet on the event-driven aspect. Therefore, we'll show you how to set up a Grafana dashboard to have a quick overview of what's could be happening with your KEDA scaling objects. Finally, we'll close this chapter, and the whole workload autoscaling section, with some best practices on how to keep your efficiency score in an optimal state.Learning how to monitor, troubleshoot, and optimize autoscaling using KEDA will help you ensure your applications remain responsive, cost-efficient, and resilient. Especially as your workloads and infrastructure evolve. You'll be building confidence in production environments, where visibility and control are just as important as automation.We will be covering the following main topics:

Workload autoscaling operations
Troubleshooting workload autoscaling
Monitoring KEDA
Upgrading KEDA
Best practices for workload efficiency

Technical requirements

Workload Autoscaling Operations

In an ideal world, systems work as expected all the time. But the reality is that the majority of the systems we interact with need maintenance, and things break all the time. This holds true with all the autoscaling tools and configurations we've explored so far. There might be times when you'll be wondering why HPA, VPA, or KEDA aren't adding or removing replicas to your workloads. As these are Kubernetes components, troubleshooting them is not very unique, and you'll end up reusing most of the tools and techniques you're already using to troubleshoot your workloads.Additionally, you've learned already by the constant emphasis we've been doing in previous chapters that right-sizing your workloads is a very important task, especially as you configure scaling rules for your workloads, which will become even more crucial when we explore node autoscaling. So far, you've deployed a simple Grafana dashboard in Chapter 3 to get a first glimpse of how to adjust the resources you request for every workload. But you'll be doing this task continuously as your workloads are going to be in constant change.Therefore, during this chapter, we'll spend some time exploring how to troubleshoot your workload autoscaling rules. And we'll show you a few options (out of many) you have to support with data how efficient your workloads are with the resources available in the Kubernetes cluster. Let's start by looking at how and which logs to read.

Troubleshooting Workload Autoscaling

In previous chapters, you've been practicing how to configure autoscaling rules for your workloads. So far, all of those hands-on labs should have been working without any problems (if they're not, please open a GitHub issue). These labs have been built to help you learn the concepts we've been covering so far. However, when you want to apply the concepts learned to your environment, things might not work as expected. So, the following sections are aimed at helping you make the troubleshooting task a bit easier. Moreover, we'll be including typical things to look at for common problems that you might not have experienced while doing the labs from this book. It's worth mentioning that we're assuming you already have a solid understanding of how these tools work, as we'll focus solely on how to do troubleshooting.

Troubleshooting HPA

One of the most common problems in HPA relies on the process of retrieving metrics. You might have some misconfigurations in place, or the metrics provider might not be working. The most basic problem could be:

The Metrics Server component is not installed in the cluster or isn't working properly. Therefore, HPA won't be able to get the basic metrics like CPU and memory from the pod(s).
Custom or External metrics aren't available. For instance, you might have configured a rule to scale using metrics from Prometheus, but there's no adapter translating those metrics, or the query you've configured has an error.
Networking issues between the cluster and the metric sources.
Role-based access control (RBAC) permissions to interact with the Metrics API.

The solution for all of the problems listed here might be obvious (i.e., deploy the metrics server, restart the metrics server, fix the scaling rule configuration, give proper RBAC permissions, etc.). But these might not be the only problems you could face. So instead of giving you an extensive list of problems with their possible solution(s), let's explore a few commands that can help you identify and confirm what could be the problem with HPA scaling rules.

Review HPA conditions and events

Within every HPA, there's a field named status.conditions that you could review. Kubernetes updates this field with some details indicating whether or not HPA could scale the target object. For instance, you could get details regarding HPA's ability to collect metrics to determine the number of replicas, if the rule is active or not, or if it's possible or not to add/remove replicas. To review the conditions from an HPA object, you need to describe the HPA object by running the following command:

$ kubectl describe hpa montecarlo-pi

You might get an output like the following:

Conditions:
  Type         Status  Reason          Message
  ----         ------  ------          -------
  AbleToScale  False   FailedGetScale  the HPA controller was unable  to get the target's current scale: deployments/scale.apps 
  "montecarlo-pi" not found

In this case, the message indicates that the montecarlo-pi deployment doesn't exist, therefore HPA is not able to scale. Once you've identified this issue, the next step would be to check if the deployment name is correct and ensure that the deployment has been created in the correct namespace. If it's missing, you should redeploy the application or update the HPA target reference accordingly. When you look at the events section from the same output after describing the HPA object, you'll see something like this:

Events:
  Type     Reason          Age  From                       Message
  ----     ------          ---  ----                       -------
  Warning  FailedGetScale  5s    horizontal-pod-autoscaler  deployments/scale.apps "montecarlo-pi" not found

Similar to the conditions field, the events section also provides valuable information about any scaling action that HPA might or might not be taking.

Check Metrics Server logs

Remember that HPA relies on the Metrics Server component. When HPA can't fetch metrics for resources like CPU and memory from the target object, then it won't be able to determine the number of replicas needed to meet the target. If you see an error message from the conditions or events section indicating that HPA couldn't fetch metrics for a resource, then you'll need to check the logs from the Metrics Server pod(s).To check logs from the Metrics Server pod(s), run the following command:

$ kubectl logs -n kube-system -l app.kubernetes.io/name=metrics-server \
--all-containers=true --tail=20

With this command, you're getting logs from all the pods that have the label/value of app.kubernetes.io/name=metrics-server, including all containers from the pod(s), and you'll only get the latest 20 log lines. From these logs, you might find the reason why metrics can be fetched from a node, for instance, due to a networking issue. Another common reason is that the Metrics Server pod(s) are failing because they're reaching a memory limit, or any other application error that is causing the pod(s) to fail. If this is happening, you need to check the resource limits and requests for the metrics-server deployment. You may need to increase its memory allocation or investigate and resolve any underlying errors. Also, restarting the metrics-server pod can also help in cases where it becomes temporarily unresponsive.That's pretty much how you can do troubleshooting in HPA. Typically, the information you get from the conditions and event fields is enough to have an idea of why scaling is not working as expected. However, there might be times when you simply get an error that says HPA is not being able to fetch metrics. If you're getting metrics from Prometheus, make sure that the query doesn't have any errors and that you're getting the value in the format you need.

Troubleshooting VPA

As you already know, VPA doesn't come as a built-in feature in Kubernetes. To use VPA, you need to deploy its three components:

VPA updater
VPA recommender
VPA admission controller

Most of the troubleshooting will be done within the controller and the recommender.So, what do you need to do if VPA is not doing what you're expecting?

Check that the VPA components are up and running
Confirm that the VPA's CRDs were deployed
Validate that the Metrics Server component is up and running
Review the events from the VPA objects
Read the logs from VPA's components to find out what's wrong
Verify why VPA can't apply recommendations
The first three bullet points are pretty straightforward, and you already know how to do this after doing the labs from Chapter 3. Therefore, let's focus on the rest of the bullet points, as some need you to understand what to look for and which commands to run.

Review the events and status of the VPA objects

To review the events that Kubernetes emits to the VPA objects, you need to run a command similar to the following to describe the VPA object:

$ kubectl describe vpa montecarlo-pi

You might not find something in the events section, but you can have a better understanding of what's happening within the conditions and recommendation fields from the conditions section. For instance, you could get an output like the following:

...
Status:
  Conditions:
    Last Transition Time:  2024-11-27T22:03:02Z
    Status:                True
    Type:                  RecommendationProvided
  Recommendation:
    Container Recommendations:
      Container Name:  montecarlo-pi
...

From this output, you can see when the last time VPA applied a recommendation was and what the recommended values were. However, if this information is still not relevant or useful, you can then check the VPA's components logs.

Read the logs from VPA's components

To check the logs from every component, you can run the following commands:

VPA recommender logs

$ kubectl logs -n kube-system -l app=vpa-recommender \
--all-containers=true --tail=20

VPA updater logs

$ kubectl logs -n kube-system -l app=vpa-updater \
--all-containers=true --tail=20

VPA admission controller logs

$ kubectl logs -n kube-system -l app=vpa- admission-controller \
--all-containers=true --tail=20

Notice that, similar to what you did with HPA, you're getting logs from all the pods from all the containers that have the label/value for the VPA's components, and you'll only get the latest 20 log lines. You might find very useful hints about why VPA is not being able to apply/calculate recommender resources to pods.

Verify why VPA can't apply recommendations

Let's talk about resources. Imagine your workload has a memory leak or the utilization is growing because it's processing a large payload. However, you might be wondering why VPA is not resizing the pods. This could be because:

You might have VPA maximum allowance or resource limits (at pod or namespace level) that are not letting VPA do apply recommendations.
The cluster can't allocate bigger pods as there are no resources available.
The pods you're expecting to be adjusted are not part of a replica controller.
VPA isn't configured with the Auto or Recreate mode.
The metrics data history length is too big or too small. By default, VPA checks 8 days back, but can be adjusted using the history-length parameter. If you're using Prometheus, you might need to check the query to adjust the period length.
As discussed in Chapter 3, the new recommendations will be applied when pods have been running for enough time, which by default is 12 hours (this is to avoid too many pod rotations after applying the first recommendation). You can adjust this configuration using the in-recommendation-bounds-eviction-lifetime-threshold parameter.
These are the most common problems you might encounter with VPA. As long as you keep in mind the role of each component, the vertical autoscaling workflow, and how to find out what's happening by reading logs/events, you might be able to spot problems very quickly.

Troubleshooting KEDA

Similar to VPA, KEDA doesn't come as a built-in feature in Kubernetes. Therefore, when you deploy KEDA to your cluster, you deploy a few pods (components) you'll use to troubleshoot when scaling isn't working. Scaling might not be working mainly because KEDA can't collect metrics from the source, but there might be other reasons we'll explore in a moment.As a quick reminder, as you learned in Chapter 4, there are three KEDA components:

KEDA's operator
KEDA's metric server
KEDA's admission webhook.

You'll likely spend more time reading the KEDA's operator logs, as that's like the KEDA's brain.So, what do you need to do if KEDA is not doing what you're expecting?

Check that the KEDA components are up and running
Confirm that KEDA's CRDs were deployed
Validate that the Metrics Server is working if using CPU and memory scalers
Validate that you have properly authenticated to the scaler API
Confirm that there aren't any network policies denying traffic to KEDA
Review the events from ScaledObject or ScaledJob objects
Review the events from the HPA objects (if scaling is active)
Read the logs from KEDA's components to find out what's wrong
Let's see how you can check events and review logs from KEDA's components.

Review the events from the ScaledObject or ScaledJob objects

After confirming KEDA's components are up and running, and CRDs are deployed into the cluster, the next step would be to review the events from the KEDA scaling objects you've created. Suppose that you deploy a ScaledObject but not the target workload it's supposed to be scaling.Deploy a sample ScaledObject by running the following command:

$ kubectl apply -f troubleshooting/scaledobject.yaml

To review the events from the ScaledObject run the following command:

$ kubectl describe scaledobject queue-consumer

You should get an output similar to the following within the Events section:

...
  Warning  ScaledObjectCheckFailed  1s (x12 over 11s)  keda-operator  
  Target resource doesn't exist
  Warning  ScaledObjectCheckFailed  1s (x12 over 11s)  keda-operator  
  ScaledObject doesn't have correct scaleTargetRef specification

The event shown was the ScaledObjectCheckFailed event, which appears when the check validation of the ScaledObject fails. Notice how KEDA's operator tried to create the HPA object, but the target workload to scale doesn't exist, showing the Target resource doesn't exist error. KEDA emits many different events, but the ones you might need to keep an eye on are the Warning event types. A few other examples are KEDAScalerFailed when the scaler fails to create/verify its event source, or KEDAScaleTargetDeactivationFailed when KEDA can't scale to 0 the target. For a full list of events, check the KEDA's doc site (https://keda.sh/docs/latest/reference/events/).

Note

If the KEDA link is not working by the time you're reading the book, you can try searching in KEDA's doc site for the list of Kubernetes events that KEDA emits.

Read the logs from KEDA's components

Reviewing events is a good start, but as you can see, they don't give you too many details about what's wrong. Therefore, you need to read KEDA's components logs to get a more precise idea about the error. But before you give it a look at that, let me give you the commands you need to keep handy.To check the logs from every KEDA component, you can run the following commands:

For KEDA operator logs

$ kubectl logs -n keda -l app=keda-operator --all-containers=true \
--tail=20

For KEDA metrics server logs

$ kubectl logs -n keda -l app=keda-operator-metrics-apiserver \
--all-containers=true --tail=20

For KEDA admission webhooks logs

$ kubectl logs -n keda -l app=keda-admission-webhooks \
--all-containers=true --tail=20

If you run the command to read the KEDA's operator logs, you should see an error describing in more detail what the problem is (which you already know is because we didn't create the target workload):

...ERROR Reconciler error{"controller": "scaledobject", \
"controllerGroup": "keda.sh", "controllerKind": "ScaledObject", \
"ScaledObject": {"name":"queue-consumer","namespace":"default"}, \
"namespace": "default", "name": "queue-consumer", \
"reconcileID": "ee2eb037-62eb-4386-8322-ae5378400ca0", \
"error": "deployments.apps \"queue-consumer\" not found"}

Notice how the error logs are more precise, saying that there's no Deployment object called queue-consumer. If you go back to Chapter 4 and complete the Hands-On: Scaling from/to Zero lab, you should see how the error events and logs are gone as you create the missing resources, and KEDA then is able to scale the application without problems.

Centralize logs and events

Before we move to the next section, I wanted to make it very clear that the commands you explored before come in handy when you need to have a quick overview of what the problem could be. However, reading events and logs in the command line doesn't scale when you end up having several scaling rules running in the same workload. At some point, there is going to be too much noise that it will be hard to find out what's happening with a specific workload you're struggling to scale.

Therefore, the recommendation would be to use a centralized tool to store and review all the events and logs from the cluster, and have a single pane of glass to troubleshoot easily. Most likely, you already have a solution for this. But if not, tools like Grafana Loki (in combination with Grafana and Alloy) or Elasticsearch (in combination with Kibana and Logstash) are a few (of many) examples of tools you can use to centralize logs and events from multiple Kubernetes clusters.

So far, you've learned how to identify and resolve common issues in KEDA by inspecting events, checking component logs, and validating deployment configurations. These are very important techniques that will help you troubleshoot and fix problems in a production environment.Next, let's explore how you can take one step further by looking at how to monitor KEDA itself using Prometheus and Grafana to give you visibility into the health and behavior of KEDA's internal components.

Monitoring KEDA

Another important aspect for operating KEDA is monitoring. Which effectively means observing metrics that define the status or health of a system. So, in this case, we're interested in observing metrics from KEDA's components. In Chapter 3, you have already used Prometheus and Grafana to monitor the resource utilization of a workload with the purpose of right-sizing the requests of CPU and memory. To monitor KEDA's component, we're going to continue using these tools and take advantage of what KEDA's components are already providing.KEDA's components are already exposing a certain number of metrics for Prometheus to scrape. As most of the work in KEDA is being done in the operator, you'll find that most of the metrics are coming from this component. However, KEDA is also exposing metrics for KEDA's metric server and KEDA's admission webhook. All these metrics can be scraped by Prometheus on port 8080 at the /metrics endpoint.KEDA will update each ScaledObject or ScaledJob object with the list of metrics that the triggers will be using. If you'd like to get this information, you need to run a command like the following:

$ kubectl get scaledobject queue-consumer \
-o jsonpath={.status.externalMetricNames}

This will return the metrics the scaling object will use, like this:

["s0-rabbitmq-autoscaling"]

The metrics you get here depend on the scaler you use, and as you continue to deploy more scaling rules using different scalers, you'll start to see a few additional metrics available. For example, the following table shows a few metrics you could use to understand the existing status of your ScaledObject rules:

Name	Description
`keda_scaler_active`	Indicates whether the scaler is active ( `1` ) or inactive ( `0` )
`keda_scaled_object_paused`	Indicates whether a `ScaledObject` is paused ( `1` ) or unpaused ( `0` ).
`keda_scaler_metrics_value`	Indicates the current value for each scaler's metric that HPA uses when computing the target (average) number of replicas.

Table 6.1 – Subset of basic metrics emitted by KEDA's operatorThis subset of metrics is going to help you understand why KEDA might or might not be scaling your workloads. But if you'd like to understand why KEDA might be having issues scaling your workloads, Table 6.2 shows a subset of metrics you could use:

Name	Description
`keda_scaler_metrics_latency_seconds`	Latency of retrieving the current metric from each scaler. Useful to know if there are connectivity issues with the scaler.
`keda_scaler_detail_errors_total`	Number of errors for each scaler.
`keda_scaled_object_errors_total`	Number of errors for each `ScaledObject` .
`keda_scaled_job_errors_total`	Number of errors for each `ScaledJob` .
`keda_trigger_registered_total`	Number of triggers per trigger (scaler) type.

Table 6.2 – Subset of metrics used for troubleshooting KEDATable 6.2 includes a subset of metrics you could use not only to understand how bad or big a problem is, but you can also use these metrics to configure alarms to notify you when something goes outside of a certain threshold. For example, it might not be harmful to get a few errors from time to time. But, if the error rate keeps increasing, you might want to get paged and fix any problems with your KEDA's scaling rules as soon as possible. As I said before, these are not the only metrics KEDA emits. The full list of metrics is available on the KEDA docs site (https://keda.sh/docs/latest/operate/metrics-server/). Make sure you check the integration with Prometheus to learn more and see other metrics we didn't include in this book, but which might be important for your workload(s).

Note

If the KEDA link is not working by the time you're reading the book, you can try searching in KEDA's doc site for the list of metrics that KEDA exposes.

Similar to what I said before about events and logs, you can't operate KEDA with kubectl only, even if it provides the information you need. Lucky for us, a while ago, KEDA released a Grafana dashboard to have a single pane of glass to view metrics exposed by KEDA. So, let's go back to the terminal and deploy the Grafana dashboard.

Hands-on lab: Deploying KEDA's Grafana dashboard

Before you deploy the Grafana dashboard, you need to enable KEDA's component to expose the metrics we explored before at the /metrics endpoint. By default, the Prometheus integration is disabled, and as we've deployed KEDA using Helm, let's update its configuration using Helm as well.Open a new terminal, and change the directory to the chapter06 folder from the GitHub repository. You'll find a monitoring/keda-values.yaml file with the additional settings you need to enable metrics exposure. The new Helm settings should look similar to this:

prometheus:
  operator:
    enabled: true
    serviceMonitor:
      enabled: true
      additionalLabels:
        release: prometheus
... // Same for metricServer and webhooks

To update KEDA's Helm release, run the following command:

$ helm upgrade keda kedacore/keda --namespace keda -f \
monitoring/keda-values.yaml

KEDA's pods should be recreated with the new settings, and you should also see that a new ServiceMonitor has been created to let Prometheus know that it should scrape the /metrics endpoint for each KEDA's component.You can confirm this by running the following command:

$ kubectl get servicemonitor -n keda

You should see three Prometheus service monitors, one for each KEDA component.Now let's deploy one of the labs you did in Chapter 4 to generate some metrics, so that when you deploy KEDA's dashboard, you can see something. In your terminal, don't change directories, and simply run the following command to deploy the application and all its dependencies to configure autoscaling:

$ kubectl apply -f ../chapter04/queue-deployment/consumer.yaml

Then, run the following command to generate a small load test:

$ kubectl apply -f ../chapter04/queue-deployment/producer.yaml

While you wait for the load test to finish, let's deploy the Grafana dashboard. To do so, open the Grafana UI by exposing the service locally by running the following command:

$ kubectl port-forward service/prometheus-grafana 3000:80 -n monitoring

In a browser, open the http://localhost:3000/ URL.For the user, type user, and for password, type prom-operator. If these credentials don't work, go back to Chapter 2 to retrieve the username and password from Prometheus stack secrets.Next, you import the Grafana dashboard, and to do so follow these steps:

Click Dashboards from the left menu, or open the /dashboards endpoint
Click on the New button, and select Import in the drop-down menu
Click on the Upload a dashboard JSON file section, and pick the file located at chapter04/monitoring/keda-dashboard.json
Click on the Load button.

You should now see the KEDA's dashboard, like the one in Figure 6.1 as follows:

Figure 6.1 – KEDA's dashboard showing scaler and ScaledObject errors

The first part is the list of filters for the dashboard. You can filter by namespace, scaledObject, scaler, or metric. If you don't see any data, try switching the datasource field.Notice that the first three visualizations show the error rates from the scaler(s) and scaled objects. This is a good source to know when things aren't going well in KEDA. Next, you'll see the current scaler metric values to understand which values KEDA is using to scale. Moreover, you'll see a graph showing the maximum number of replicas and how KEDA has been adding and removing replicas while the load test was running. In this case, I ran the load test twice, that is why you see two changes in Figure 6.2 :

Figure 6.2 – KEDA's dashboard showing number of replicas in a sample workload

If you continue scrolling down, you will see other visualizations to understand changes in the replica numbers for the workloads you're monitoring. Additionally, you'll see a graph that represents how close you're getting to the maximum number of replicas.The version of the dashboard we showed in this book might differ from what you get when doing this lab. The reason is simple: KEDA occasionally updates the dashboard, and we may have a different version of the JSON manifest at the time of writing this book. If you'd like to get the latest version of this dashboard, make sure you check KEDA's official repository.In this lab, you set up KEDA's Grafana dashboard to visualize key metrics and error indicators across scalers and scaling objects. This provide insights into how your workloads are behaving and helps you quickly detect issues when autoscaling isn't working as expected.Now, let's look at how to keep your installation up to date. In the next section, we'll walk through different strategies for upgrading KEDA.

Upgrading KEDA

Another important topic related to operating KEDA is upgrades. We recommend always using a recent version of KEDA to make the most of it, especially as KEDA will continue adding new features, fixing issues, and adding native scalers. Depending on how you decide to deploy KEDA's components to production, the upgrade process might differ from the one you'll see in this book.In Chapter 4, we used Helm to install KEDA in a very simple way. However, I've seen many organizations switching to a GitOps model using Argo CD and deploying KEDA by using YAML manifests. GitOps and Argo CD allow you to manage your Kubernetes applications using Git. Instead of manually applying changes, you just update a Git repo, and Argo CD makes sure your cluster matches what's in the code. Many companies like using them for apps like KEDA because it's easier to keep track of changes and fix things if they break.

Note

If you decide to go with the approach described in this section, and you've installed KEDA using Helm, make sure you uninstall KEDA using Helm first using the following command: helm uninstall keda -n keda. Keep in mind that when you uninstall KEDA, all the ScaledObjects and/or ScaledJobs you have will be removed. Then, install KEDA using the commands discussed later and use kubectl create instead of kubectl replace.

When you install or upgrade KEDA, it's important that you are specific about the version you want to install. This is so you can control which version to install when upgrading KEDA. It's also important that you read the release notes to understand the impact of using a newer version. CRDs might change, and you'll need to update any existing KEDA objects. Additionally, the release notes are where you'll find recommendations on how to handle any breaking changes. Try to keep KEDA up to date and avoid falling too far behind the latest version. Regular updates help ensure you benefit from the latest features and security improvements while minimizing the complexity of future upgrades.To upgrade KEDA using only kubectl, you can run the following commands:

$ export KEDA_VERSION="2.16.0"
$ curl -L -o keda.yaml \
https://github.com/kedacore/keda/releases/download/v$KEDA_VERSION/keda-$KEDA_VERSION.yaml
$ kubectl replace keda.yaml

Note

If you get an error like "Error from server (NotFound): error when replacing "keda.yaml": customresourcedefinitions.apiextensions.k8s.io "XXX" not found", simply run kubectl create keda.yaml and ignore the errors saying that an object already exists. When you use tools like Argo CD, they already have internal validations to avoid this problem of using kubectl replace or kubectl create, depending on the status of the objects.

Notice that we specified the KEDA version we want to install. Next time you need to upgrade, you'll use the specific version number you want to deploy. Then, it's a good practice to download the release YAML manifest, which includes every component you need. Finally, you simply use kubectl to replace existing Kubernetes objects with the most recent ones.The benefit of downloading the YAML manifest is that you can version it in Git and keep track of the changes you'll apply next. This is key if you're going to use GitOps.Regardless of the approach you use to upgrade KEDA, make sure that you practice doing it in a pre-production environment, and that you're aware that scaling actions might be paused for a very short period of time.Upgrading KEDA regularly ensures you benefit from the latest features, improvements, and bug fixes, but more importantly, it allows you to evolve the autoscaling capabilities of your workloads.Now, let's take a step back and reflect on the broader picture. In the next and final section of this chapter, we'll go over a set of best practices for workload efficiency. These principles will carry forward as we shift focus from application-level scaling to optimizing cluster-wide efficiency in the chapters ahead.

Best practices for workload efficiency

To close this chapter and the whole section about workload autoscaling, let me list the most important practices we've explored so far that will help you achieve a good efficiency score for your workloads. I'm not attempting to summarize everything you've read in only a few lines in this section. The main reason I want to highlight these practices is that they'll be crucial for the following chapters, where we'll explore the efficiency of nodes in a Kubernetes cluster.First, the most important task is to understand what influences the performance of your applications. Throughout the previous chapters, we didn't talk only about CPU or memory utilization. Typically, there are other aspects like latency, number of requests, error rate, and dozens of other events, such as messages in a queue or custom business metrics (e.g., using scaling modifiers in KEDA). As applications continue to evolve, what affects performance today might not be the same tomorrow. So, it's key that you don't approach this as a one-time task, but as something you need to do continuously. You need to constantly test your applications to understand them better (as we did a few times in the previous chapters).There's nothing wrong with starting with only CPU or memory utilization as a metric driver for autoscaling your workloads. In fact, these are the first set of attributes the kube-scheduler will use to schedule your pods accordingly. So, start by understanding the CPU and memory requirements for your applications.Once you know how much CPU and memory your applications need, it's imperative that you configure a proper number of requests for all containers in a pod. You won't only have a deterministic result when pods are scheduled, but it will also be used as the base to determine how much capacity the cluster needs to have. We'll explore this topic in more depth in the upcoming chapters, and you'll understand why I kept pushing for you to have a proper configuration in this regard.If you're new to autoscaling, consider starting with KEDA. You can begin with CPU and memory, and then use other attributes as we've explored so far. And with KEDA, it's possible to scale to zero for almost all the scalers. Moreover, consider scaling with multiple metrics. There are times when all the pod replicas don't receive the same amount of traffic, and even if the average CPU utilization looks good, your users might be experiencing high latencies or application errors. Thus, the importance of not treating autoscaling as a one-time process.Depending on the application needs, there will be times when you need to scale faster or slower. Take advantage of the scaleUp or scaleDown behaviors you can configure in HPA, either to control the rate (or speed) of replica changes or to prevent replica fluctuation (flapping). Some applications can, in fact, have problems if you scale too fast, and you'd like to spin up Kubernetes nodes only when they're really needed.Always keep these practices in mind as a minimum set of actions you need to take. As you progress in your workload efficiency journey, you might discover another set of practices you'll consider "best". This will be especially true once you finish the next four chapters about autoscaling the nodes you have in the cluster.

Summary

In this chapter, we delved into the operational aspects of workload autoscaling in Kubernetes, focusing on troubleshooting and monitoring techniques for HPA, VPA, and KEDA. We explored common issues that can arise when implementing autoscaling and provided practical approaches to diagnose and resolve these problems.We began by examining how to troubleshoot HPA, emphasizing the importance of reviewing conditions and events. Similarly, for VPA, we discussed the process of verifying component status, reviewing events, and analyzing logs from all the VPA components. Then we focused on KEDA. We covered how to review events from ScaledObject and ScaledJob resources and how to interpret logs from KEDA's components. We also stressed the importance of centralizing logs and events for more effective troubleshooting in larger environments.Beyond troubleshooting, we explored monitoring strategies for KEDA, introducing key metrics that can be used to understand the health and performance of KEDA's components and autoscaling rules. We provided a hands-on guide to deploying a Grafana dashboard for visualizing KEDA metrics, offering a broader view of autoscaling activities when using KEDA.Another operational aspect we covered was upgrading KEDA. We described some of the best practices and considerations for maintaining an up-to-date KEDA deployment. We discussed different approaches to upgrading, including the use of kubectl and potential integration with GitOps workflows.Finally, we concluded with a set of best practices for achieving workload efficiency. These practices emphasized the importance of understanding application performance drivers, properly configuring resource requests, and leveraging KEDA's capabilities for advanced autoscaling scenarios. We underscored the need for continuous monitoring and adjustment of autoscaling configurations to ensure optimal performance and resource utilization.Now, you're ready for the next adventure: the journey to have efficient nodes.

7Data Plane Autoscaling Overview

We're halfway through the journey of building automatically efficient Kubernetes clusters for your workloads. In Chapter 1, we provided a brief introduction to autoscaling in the Kubernetes ecosystem. In short, autoscaling in Kubernetes is done in two ways: the first is scaling your application pods, and the second is scaling the underlying infrastructure needed to run your application pods. We've spent quite some time diving into the details of how to scale your application pods efficiently. Now, for the next four chapters, we'll focus on how to automatically scale the compute capacity for your Kubernetes clusters.First, we'll cover what it means to scale the data plane of a Kubernetes cluster. You'll learn why you need to take action and dive into the details to gain a complete view of autoscaling in Kubernetes. Even though the data plane autoscaling concept is basic, you will then learn about the different options you have to autoscale a data plane with projects like Cluster Autoscaler (CAS) and Karpenter. These are currently the two main projects that take responsibility for adding compute capacity when your application needs it. We'll then briefly explore other relevant scaler projects, such as Descheduler and Proportional Autoscaler. You'll learn about the problems all of these projects are solving so that you know your options. But after finishing this chapter, we'll focus solely on Karpenter.This chapter will start by diving deep into CAS, the first project that has been a long-runner in the ecosystem. If you're an existing user of CAS, you might learn new things about how to optimize it for greater efficiency. Moreover, we'll use this section to develop a solid understanding of why Karpenter was created and why its approach differs from CAS. Learning how CAS works it's a foundational step toward understanding why Karpenter was created and will help you to implement autoscaling best practices for CAS while you decide if you transition to Karpenter.In this chapter, we are going to cover the following main headings:

What is data plane autoscaling?
Data Plane Autoscalers
Cluster Autoscaler on AWS
Relevant Autoscalers

Let's get into it and start warming up your console, as you're going to learn by doing.

Technical requirements

For this chapter, you'll continue using the Kubernetes cluster you created in Chapter 1. However, from this chapter onwards, you won't be able to use the local Kubernetes cluster you've been using (unless you've opted for the cloud option). Also, keep in mind that for this book, we're using AWS EKS. So, if you haven't created the AWS EKS cluster yet, go back to Chapter 1 to learn how to do it. If you've been using this option already, just make sure you bring the cluster up for every hands-on lab in this chapter. You don't need to install any additional tools, as most of the commands are going to be run using kubectl and helm. You can find the YAML manifests for all resources you're going to create in this chapter in the chapter07 folder from the Book's GitHub repository you already cloned in Chapter 1.

What is data plane autoscaling?

Before we start exploring data plane autoscaling in depth, let's recap how autoscaling works in Kubernetes and what the scaling flow is. As you already know, scaling in Kubernetes is divided into two categories. Look at Figure 7.1. The first category illustrated is about scaling application workloads. As the demand grows, in this case, the traffic increased five times in the application, and it then needs more pod replicas to fulfill that demand.

Figure 7.1 – Pod replicas increase based on demand

The second category in Figure 7.2 is about the Kubernetes worker nodes, better known nowadays as the data plane, which works in a proportional manner to the application workload scaling. When more pods are needed and existing capacity can't host any more pods, known as unscheduled pods. Therefore, more nodes are needed, and controllers like CAS or Karpenter launch new nodes. And, when the application demand shrinks, requiring a smaller number of pods, typically fewer nodes are needed.

Figure 7.2 – Nodes are launched based on unscheduled pods

From the two previous figures, and based on what you've learned in the previous chapters, you could say that similarly to what workload autoscaling does with pods, data plane autoscaling does with nodes, with two goals in mind:

Provision the necessary nodes to the clusters so all pods can be scheduled
Remove underutilized nodes to reduce waste.

Let's be more precise with a simple example to understand the implementation details. Look at Figure 7.3 representing an application that only needs 6 pods running to support the existing traffic (1). At a certain moment in time, the traffic increases and the existing 6 pods can't handle the load, causing your application to degrade (2). To prevent that degradation from happening, you've configured KEDA to add more pod replicas when the throughput increases. KEDA will update the relevant metrics, and then HPA will update the replica number, let's say to 9 replicas. Therefore, you need 3 extra pods. Kubernetes will try to schedule 3 more pods in the cluster. But as you can see in Figure 7.3, the cluster capacity is very limited (3).

Figure 7.3 – Application workload scaling based on throughput using KEDA

At this moment, as per Figure 7.3, the cluster has no capacity to run 3 additional pods. Therefore, your application will end up having 3 unscheduled pods, which will stay in the PENDING state until there's enough capacity in the cluster. KEDA and HPA help you determine how many pod replicas are needed. However, how many nodes would you need to add? Data plane autoscalers need to aggregate what unscheduled pods are requesting, and, through binpacking, determine the number of nodes needed. Binpacking in this context refers to the efficient allocation of pods to nodes, maximizing resource utilization by fitting as many pods as possible onto each node without overloading it, similar to packing items into containers optimally.Kubernetes doesn't have a native feature to automatically create new nodes, similar to what HPA does when more pods are needed. Why? Well, there are different tools and providers to create a Kubernetes cluster, and the implementation details of creating nodes vary a lot. For instance, you'll need to use different tools to create a cluster on-premises, AWS, Azure, GCP, or any other infrastructure provider. Each of these providers has its own set of APIs, services, and configuration properties. Moreover, each provider will know better how to use their own APIs.Therefore, to address this challenge of automatically adding the capacity you need into the cluster, you need to use open-source projects like CAS or Karpenter, which may or may not have support for the infrastructure provider you're using. But that's a different discussion. In the meantime, let's start exploring these two projects. We'll dive a bit into Cluser Autoscaler in this chapter, and Karpenter in the following chapters. We're starting with CAS because will help you understand what brought Karpenter and implement autoscaling best practices in CAS to start having efficient clusters right away in case the transition to Karpenter will take some time.

Data Plane Autoscalers

Data plane autoscalers operate independently of application-level scaling mechanisms like HPA or KEDA, focusing solely on cluster-level resource management. This approach helps maintain a balance between having enough capacity for applications to run efficiently and avoiding over-provisioning of resources, regardless of whether the cluster is running in the cloud or on-premises.Moreover, you might have different applications, with different needs, running on the same node. Scaling nodes based on metrics like CPU utilization, or even based on other attributes like latency and throughput, might create conflicts or race conditions with the scaling rules you've configured for the workloads with KEDA or HPA, especially if both are using the same metrics/events. Therefore, the projects we'll explore next only react when there are unscheduled pods in the cluster. Let me re-emphasize what I just said as it's very important.Data plane autoscalers are triggered by unscheduled pods only, not by any other metric or event like CPU utilization or message queue's depth. This is where most newcomers looking to implement Kubernetes autoscaling get confused. Projects like CAS and Karpenter don't add capacity based on resource utilization; they only add capacity when there are unscheduled pods. Therefore, workload autoscalers and data plane autoscalers complement each other, and you need both to fully implement Kubernetes autoscaling.Another very common misconception is regarding scheduling. Data plane autoscalers are solely responsible for providing capacity to the cluster, not to schedule pods. The kube-scheduler is still responsible for scheduling pods. Let's proceed with the next section to examine the approach CAS and Karpenter take to scale nodes.

Cluster Autoscaler

Cluster Autoscaler (CAS) was one of the first projects to address the challenge of automatically scaling the data plane of a Kubernetes cluster, and for a long time, it was the only one. The primary function of CAS is to ensure that there are always sufficient nodes available in the cluster to run all scheduled pods. When there are pods that cannot be scheduled due to insufficient cluster capacity, CAS will automatically provision new nodes to accommodate these unscheduled pods. Conversely, when nodes are empty, CAS removes these unnecessary nodes for cluster efficiency.In cloud environments, CAS integrates with cloud provider APIs to manage node groups or auto-scaling groups. For example, in AWS, it can work with EC2 Auto Scaling Groups to add or remove instances as needed. In GCP, it can manage node pools, and in Azure, it works with Virtual Machine Scale Sets. This integration allows CAS to leverage cloud-native scaling capabilities, making your Kubernetes cluster efficient and cost-effective in cloud deployments.For on-premises environments, CAS can also be configured to work with various infrastructure provisioning tools. However, the process is often more complex and may require additional setup and integration work. On-premises setups might involve integrating CAS with tools like OpenStack, VMware vSphere, or bare-metal provisioning systems to manage the lifecycle of physical or virtual machines.CAS works with various cloud providers and on-premises setups, making it a valuable tool for managing Kubernetes cluster capacity across different environments, though the ease of implementation can vary significantly between cloud and on-premises deployments.

Karpenter

For a very long time, since April 2017, CAS was the only project available for autoscaling the data plane in Kubernetes. However, the landscape changed with the introduction of Karpenter, a Kubernetes cluster autoscaler developed by AWS as an open-source project. Karpenter was announced as ready for production in November 2021, and its v1 version was released in August 2024.Karpenter takes a dynamic and fine-grained approach to node provisioning and manages nodes directly without any abstraction layer like Auto Scaling groups in AWS. Like CAS, Karpenter provisions new nodes in response to unschedulable pods, and it attempts to optimize node selection to maximize resource utilization based on pods' resource needs, considering scheduling requirements like affinities or tolerations. When a node is not needed, Karpenter can remove it or replace it with a smaller one, this process is also known as consolidation. We'll explore all the details about how this feature works in the next chapter.Additionally, it's worth mentioning that Karpenter works in tandem with the kube-scheduler. This means that Karpenter won't schedule pods in the cluster; it will only provision nodes. It's very important that you keep this in mind to make the most of Karpenter. When we look at the implementation details in the next chapter, this will be clearer, and you'll understand why I'm emphasizing this now.Karpenter uses Kubernetes custom resources for configuration, making it easier to manage and integrate with existing Kubernetes workflows. While initially designed for AWS, Karpenter's design principles aim to make it adaptable to other Kubernetes environments in the future. At the time of writing this book, besides the EKS provider, Microsoft has released the Karpenter provider for Azure Kubernetes Service (AKS). In the next chapter, we'll dive deeper into how the Karpenter project is structured.In the meantime, let's explore how Cluster Autoscaler works in AWS.

Cluster Autoscaler on AWS

Note

Before we begin this section, as mentioned earlier, CAS is supported on different platforms and cloud providers. We've decided to focus on AWS in this book for two reasons: first, to maintain consistency across all hands-on labs, and second, to help you better understand why and how Karpenter's implementation differs from CAS. Moreover, if you're not currently using CAS and are planning to start with Karpenter, you may skip this section and move to the Relevant Autoscalers or Summary section in this chapter. However, if you're currently using CAS and not planning to migrate to Karpenter in the near future, this section will help you understand which best practices need to be in place for an efficient data plane.

To use CAS in AWS EKS, you typically deploy it as a Deployment in the cluster. Once CAS is running, it ensures that the Auto Scaling groups dynamically adjust cluster capacity based on workload demands.Look at Figure 7.4. It's hard to represent how CAS works with a static image, but as we mentioned earlier when introducing the data plane autoscalers concept, CAS is continuously monitoring the cluster's state, and it reacts when there are unscheduled pods. The purpose of CAS is to ensure that there is enough compute capacity in the cluster to run your workloads. When the kube-scheduler notices that there are pods that cannot be allocated due to insufficient resources, CAS initiates the process of expanding the relevant Auto Scaling groups (adding a +N to the desired capacity, while avoiding changes to the minimum and maximum values). Conversely, as pods are removed and some nodes remain underutilized for an extended period, CAS orchestrates their removal from the cluster.

Figure 7.4 – CAS adjusts the compute capacity of relevant Auto Scaling groups

As you're going to deploy CAS as a Deployment within the cluster, you need to ensure CAS has the necessary IAM permissions to describe and adjust the Auto Scaling groups on your behalf.In terms of how CAS manages the nodes, there's an Auto Discovery mode, which is the preferred method to allow CAS to automatically identify and manage Auto Scaling groups based on predefined tags. This mode also enables CAS to consider Kubernetes scheduling constraints, including node selectors, taints, and tolerations. With all this information from the workloads and nodes, CAS can make informed decisions about scaling, especially when scaling from zero nodes, ensuring that new nodes are compatible with the scheduling requirements of unscheduled pods.Another important aspect we wanted to cover is how you configure node groups in EKS. Well, there are two ways of doing it:

Managed node groups: With this method, EKS abstracts the complexity of manually provisioning the Auto Scaling group, and follows the best practices on how to configure them, along with additional features like node version upgrades and graceful termination.
Self-managed node groups: With this method, you're responsible for provisioning the Auto Scaling group with the proper scripts so nodes register to the cluster when they're created.

Both options are an abstraction layer in EKS on top of an Auto Scaling group. However, the recommended approach for node groups is to use managed node groups.At first glance, that's how CAS works on AWS. There are other implementation details about how CAS works, but they're out of scope for this book. Although, we'll explore a few best practices for CAS simply because they'll become relevant in the next chapter.So for now, let's see CAS in action.

Hands-on lab: Cluster Autoscaler on AWS

Note

This hands-on lab is going to use Terraform to create or update (if you created it already in Chapter 1) an EKS cluster. We recommend that you use this method as it will help you to get started quickly and focus on learning about how CAS works.

The setup we've been using so far for the Kubernetes cluster has been static. This means that if you ended up deploying too many pods in the cluster, you'd have seen that some pods were unscheduled, as there was no capacity in the cluster. If you've been using the EKS cluster from Chapter 1, you've been working with a managed node group of only two nodes. During this lab, you'll deploy around 60 pods to force CAS to react when there are unscheduled pods and see how new nodes are being added automatically.

Deploy CAS

Go back to your terminal and make sure you're in the /chapter07 folder from the book's GitHub repository. In here, you'll find a terraform folder that includes the scripts you'll be using to create an EKS cluster with CAS and other addons.If this is the first time you're creating the EKS cluster, you don't need to make any changes to these scripts. However, if you already created the EKS cluster following the instructions from Chapter 1, then you'll need to simply remove the # character in line 147 from the /chapter01/terraform/main.tf file to use Terraform to deploy CAS. This will not only deploy the CAS controller, but it will also take care of configuring it with the proper IAM permissions.The Terraform template should look like this:

...
  enable_cluster_autoscaler = true
...

To either create the EKS cluster or apply the changes made to install CAS, run the following commands:

$ sh bootstrap.sh

If you're creating the EKS cluster, wait around 15 mins while the control plane is created and ready to register worker nodes, but if you're simply updating it with CAS, wait around 5 mins. Once the previous command has completed, run the following command to confirm that you have CAS up and running:

$ kubectl get pods -n kube-system -l app.kubernetes.io/instance=cluster-autoscaler

The CAS pod should be in a Running status, and you should be able to proceed. As mentioned before, the EKS cluster already has a static managed node group with only two nodes. We're going to force CAS to add more capacity by deploying a workload with several pods.

Remove VPA rules that could resize the sample application

If you didn't create the cluster from scratch in this chapter, you might have some VPA rules configured already from previous hands-on labs. To make sure you keep this lab consistent, let's remove any VPA rule you might have that could resize the sample application. You could remove only the VPA rule that could affect the sample application, but if you want to remove all VPA rules, simply run this command:

$ kubectl delete vpa --all

Nothing else should be resizing the pods, but we'll confirm it later on anyway.

Deploy a sample application to see how CAS adds new nodes

Let's continue using the sample application we've been using in the previous chapters. It doesn't have anything special, but let's remark the requests section looks like this:

...
        resources:
          requests:
            cpu: 900m
            memory: 512Mi
...

Notice that it follows the recommendation of intentionally requesting a specific number of CPU and memory resources. This is the information CAS will use to decide how many nodes will need to be added in case there are unscheduled pods, which is something we're going to force in a moment. For now, let's deploy the application:

$ kubectl apply -f montecarlopi.yaml

You should see only one montecarlo-pi-* pod running. Let's make sure that the pod's requests weren't modified. Run this command to verify:

$ kubectl describe pods | grep -i requests -A 2

You should see that the CPU requests remain the same at 900m.As you were previously adding a few replicas, let's be a bit more aggressive and scale the deployment to 12 replicas. This will require CAS to add a few additional nodes. We're going to keep it simple and scale the application manually. So, run the following command to do so:

$ kubectl scale deployment montecarlo-pi --replicas=12

As there was enough capacity in the cluster to host most of the pods, some pods should be in a Running status, but others will be in a Pending status. Wait around 30 seconds, and run the following commands to see how many unscheduled pods are:

$ kubectl get pods --field-selector=status.phase=Pending

You should see a few unscheduled pods, causing CAS to add new capacity to the cluster by modifying the desired capacity of the Auto Scaling group. Wait around two to three minutes, and you should see all replicas running, and you should also see that new nodes have been added to the cluster. To confirm it, run this command:

$ kubectl get nodes

The number of nodes you see will depend on how many pods you had running before manually scaling out the sample application. If you didn't have any other applications running, you might see only one extra node.

Remove the sample application to see how CAS removes unnecessary nodes

The idea of using CAS is to not only make it easier to add more capacity when it's needed, but also to remove capacity when it's not needed. So, let's remove the sample application by running the following command:

$ kubectl delete -f montecarlopi.yaml

By default, CAS will wait 10 minutes before a node is removed. So, give it time, and while you wait, you could review the CAS logs to confirm that empty nodes will be removed after 10 minutes. To review CAS logs, run this command:

$ kubectl logs -n kube-system -l app.kubernetes.io/instance=cluster-autoscaler --all-containers=true -f --tail=20

Once you see in the logs that the nodes were removed, you could run the previous command you ran to get the list of the nodes to confirm it. Don't proceed to the next step until the extra nodes are removed.

Uninstall CAS

In the following chapters, we won't be using CAS anymore, so let's remove it now. If you used the Terraform installation method, simply set to false the value from line 147 in the /terraform/main.tf file like this:

...
  enable_cluster_autoscaler = false
...

Then run the following script again to apply the changes to remove CAS (along with all the other resources Terraform created):

$ sh bootstrap.sh

Wait around five minutes, and confirm that CAS is no longer installed:

$ kubectl get all -n kube-system | grep autoscaler

If CAS was uninstalled successfully, the above command shouldn't return any results.You've come to the end of this lab. In this hands-on lab, you learned how to deploy CAS into an existing EKS cluster using Terraform. You simply have to activate a flag, and the Terraform module creates and configures all the resources needed to run CAS in EKS. You then saw CAS in action, adding new nodes when there were unscheduled pods, and then removing the extra nodes 10 minutes after the application was removed.Now that you have the CAS basics, let's move to the next section to learn about some of the CAS best practices on EKS.

Cluster Autoscaler Best Practices

Even though this book doesn't go in depth about CAS, you might be in a transition phase to Karpenter, and it would be wise to know about the minimum set of configurations and considerations you need to implement when using CAS for EKS and configure efficient Kubernetes clusters. Moreover, it will help you understand why some implementation details in Karpenter are different and why the following recommendations will impact your efficiency score.

Use homogeneous instance types

This practice might be the most important one in terms of compute efficiency. To configure a node group, one of the required parameters is to set an instance type that will be used to launch an EC2 instance. A best practice in AWS, regarding instance types, is to configure not just one type but as many types as possible (the more, the merrier). The reason for this recommendation is to secure the capacity you'll need (although, to truly reserve capacity, you have other mechanisms like On-Demand Capacity Reservations (ODCR), but that's out of scope for now). When EC2 can't launch instances of a specific instance type (the cloud is not infinite), the Auto Scaling group can launch an EC2 instance of another type, from the multiple ones you've configured that you're flexible to use. This is known as instance type diversification.If your node group is using EC2 Spot instances, then diversification becomes even more important. EC2 Spot instances are spare capacity from On-Demand, and when On-Demand needs that capacity back, you receive a Spot interruption. This means that the EC2 Spot instance will be terminated in two minutes. However, the good news is that the Auto Scaling group can launch an EC2 Spot instance of another type where there's more spare capacity available, and it will pick the cheapest one when there are instance types with similar spare capacity numbers.But how is all of this previous information relevant to CAS and using homogeneous instance types? Well, when you use multiple instance types, CAS will consider the first type in the list to calculate how many extra nodes are required based on the specs of that first instance type. If the list has types with different specs, and the Auto Scaling group ends up launching instances with different specs than what CAS used to calculate the number of extra nodes to launch, you might end up with the challenge of either wasting capacity (nodes with more specs than the first instance type) or constraining the cluster with smaller nodes (nodes with fewer specs than the first instance type). Of course, at some point, either CAS will remove unnecessary nodes or launch new nodes if there are still unscheduled pods. But while that happens, you'll be paying for resources you don't really need or experiencing a degradation in your workloads.Therefore, the recommendation is to use homogeneous instance types, the same CPU, and memory specs. If CAS uses the first instance type to decide how many extra nodes to launch, it won't matter which instance type the Auto Scaling group ends up launching, because it will have the same specs CAS considered initially.

Limit the number of node groups

While the previous recommendation was to use as many homogeneous instance types as possible in one node group, you need to limit the number of node groups you have in the cluster. The reason is because of how CAS works.In order to optimize pod scheduling, CAS scans the cluster every 10 seconds (by default), and loads into memory information about the cluster, like pods, nodes, and EKS node groups, to simulate scheduling and make decisions. For example, if there are multiple node groups, CAS must decide which group to scale out. Therefore, the larger the number of node groups you have, the slower CAS might be in scaling out your cluster. So, keep the number of node groups at a minimum.You might be wondering why someone would need to have multiple node groups in a cluster. In the previous section, you learned that to optimize efficiency in a cluster, it's recommended to use homogeneous instance types to mitigate any compute capacity constraints you might face when launching nodes, especially when you are using Spot instances. Another reason could be when organizations use multiple node groups to isolate workloads in a cluster. Instead, if possible, isolate workloads at the Kubernetes namespace level.

Configure a sane scaling speed

While it might be tempting to configure a very small scan interval to achieve faster scaling, this approach can lead to unintended consequences. Although a shorter interval could theoretically make CAS more responsive with spiky workloads or where speed matters very much, it's important to consider that launching new EC2 instances takes time, often several minutes. Setting an overly aggressive scan interval might result in CAS making unnecessary API calls before the previously requested nodes are even ready.Moreover, frequent API calls can quickly exhaust rate limits imposed by AWS, potentially leading to API throttling or even service disruptions. A good practice is to set the scan interval to a value that balances responsiveness with API efficiency. For example, setting the interval to 1 minute instead of the default 10 seconds can significantly reduce API call volume while only marginally increasing the scale-up time. If you do not need to modify the default configuration, just don't do it unless you've run proper testing and understand which values work for your use case.You can adjust the scan interval using the --scan-interval flag. Additionally, you can fine-tune scale-down behavior using flags like --scale-down-delay-after-add, --scale-down-delay-after-delete, and --scale-down-delay-after-failure. These allow you to control how long CAS waits before considering scale-down operations after various cluster events.

Use a Priority Expander with Multiple Nodes

Cluster Autoscaler uses expanders to decide which node group to scale when multiple options are available. The default expander is random, which, as the name suggests, chooses a node group at random. However, when working with multiple node groups, especially in large and complex environments, you might want more control over the scaling decisions.This is where the priority expander becomes valuable (not many people know this configuration exists and why it's useful). By using the --expander=priority flag, you can define a specific order in which CAS should consider node groups for scaling. For example, this is particularly useful when you have a preferred set of instance types or when using Spot instances.Let's say you're using Spot instances with different tiers for size because, as you've learned before, it's a good practice to use node groups with homogeneous instance types. Or, maybe you'd like to give priority to instances from the latest generations. With a priority expander, you can define a priority list of which node types CAS should use, like this:

A node group with only large Spot instances
A node group with xlarge Spot instances
A node group 2xlarge Spot instances

CAS will try to scale the first group, and only if it can't (due to capacity constraints or other issues), it will move to the second, and so on. You can define this priority list in a ConfigMap like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: spot-priority-expander
  namespace: kube-system
data:
  priorities: |-
    10:
      - .*-spot-large
    20:
      - .*-spot-xlarge
    30:
      - .*-spot-2xlarge

While this approach provides fine-grained control, it's important to note that it's a sequential process. If capacity isn't available in the higher-priority groups, it may take longer to eventually scale up using lower-priority options. Even if you configured a short scan interval.

Use a single Availability Zone for persistent volumes

When working with persistent volumes, particularly Amazon EBS volumes, it's crucial to consider Availability Zone (AZ) placement. EBS volumes are AZ-specific, meaning a pod using an EBS volume must be scheduled in the same AZ as the volume.If you're using multi-AZ node groups with persistent volumes, you risk scenarios where pods can't be scheduled because they're assigned to nodes in a different AZ from their volumes. This can lead to data unavailability or, in worst-case scenarios, data loss. This consideration is particularly important for applications that rely on data sharding or replication, such as Elasticsearch. In these cases, improper AZ management could impair the application's ability to maintain data integrity and availability.To mitigate these risks, consider using single-AZ node groups for workloads that depend on persistent volumes. This ensures that pods are always scheduled in the same AZ as their volumes. Alternatively, if you need multi-AZ resilience, consider using storage solutions that span multiple AZs, such as Amazon EFS or Amazon FSx for Lustre.An additional benefit of this approach is the reduction of inter-AZ network traffic or reduced latency within the cluster, which can help optimize costs and improve application performance. When data and compute resources are co-located in the same AZ, you avoid the additional latency and potential data transfer costs associated with cross-AZ communication.Now that you've seen the key practices to follow when using CAS, it's time to expand your view about the autoscaling landscape. While CAS and Karpenter handle most of the heavy lifting for compute scaling, Kubernetes' ecosystem has additional tools that solve complementary problems like balancing workloads more effectively or adjusting control plane components dynamically.In this next section, we'll take a brief look at some of these tools so that you can better understand your full range of options and when each might be appropriate to use.

Relevant Autoscalers

While CAS and Karpenter are the primary tools for managing node-level scaling in Kubernetes clusters, there are other autoscaling projects that address specific needs within the Kubernetes ecosystem. These tools complement the core autoscalers and work in harmony with all the previous projects we've explored so far.In this section, we'll briefly introduce three such projects: Descheduler, Cluster Proportional Autoscaler (CPA), and Cluster Proportional Vertical Autoscaler (CPVA). The aim is to help you get familiar with the autoscaling landscape in Kubernetes and help you identify when and where they might be useful in your own clusters.Each of these autoscalers serves a unique purpose:

Descheduler focuses on optimizing pod placement after initial scheduling.
CPA helps scale auxiliary services proportionally to the cluster size.
CPVA aims to vertically scale resources based on cluster metrics.

Later, in Chapters 11 and Chapter 12, we'll see some of these tools in action to help you understand how and when they can be integrated into your autoscaling strategy.

Descheduler

While the initial scheduling of pods is handled by the Kubernetes scheduler, over time, the distribution of workloads can become uneven. This is where the Descheduler comes into play. The Descheduler is an open-source project that aims to improve cluster resource utilization by moving pods from over-utilized nodes to under-utilized ones. It runs as a Kubernetes add-on and uses various informers to monitor cluster resources and make decisions based on predefined policies.Imagine a scenario where some nodes in your cluster are running at 90% capacity while others are barely breaking 30%. The Descheduler can identify this imbalance and evict pods from the highly utilized nodes, allowing them to be rescheduled on the under-utilized nodes. This rebalancing act not only improves overall cluster performance but can also lead to cost savings by optimizing resource usage. To use the Descheduler, you'll need to deploy its controller in your cluster and configure a Policy ConfigMap with your desired strategies. Common strategies include LowNodeUtilization and RemoveDuplicates. Once set up, the Descheduler will automatically analyze and rebalance your cluster based on these policies.When using the Descheduler, it's important to adopt a cautious approach to minimize potential risks. For instance, it's recommended to use Pod Disruption Budgets (PDBs) to ensure a minimum number of pods remain available during descheduling operations. Moreover, carefully configure the Descheduler's strategies to match your cluster's specific needs, and consider using the simulation mode to test configurations without affecting live workloads. Implement a gradual rollout, starting with a small subset of your cluster and expanding as you gain confidence. Leverage node affinity and anti-affinity rules to guide pod placement and prevent critical workloads from being disrupted. At this point in the book, you should have guessed already that proper resource requests and limits for pods will help the Descheduler make more informed decisions.

Cluster Proportional Autoscaler

CPA is an open-source project that scales workloads based on the size of the cluster itself, and doesn't rely on metrics such as CPU or memory usage. The primary purpose of CPA is to maintain an appropriate number of replicas for cluster-wide services as your cluster grows or shrinks. This makes it particularly useful for managing system components like cluster-dns or monitoring services, which typically need to scale in proportion to the overall cluster size. CPA is currently in beta, but I've seen many customers using it to scale workloads without problems.CPA is a controller that continuously monitors the number of schedulable nodes, nodes in the cluster that have available capacity to run new pods, and CPU cores available in your cluster. Based on this information, it adjusts the number of replicas for the target resource using either linear or ladder scaling methods. To use CPA, you'll need to deploy it as a separate controller in your cluster and configure a ConfigMap with your desired scaling parameters. You'll then set up the target workload (such as a Deployment) to be managed by CPA.It's important to note that CPA may not be the best choice for application-specific workloads that need to scale based on their own unique metrics or usage patterns.

Cluster Proportional Vertical Autoscaler

While CPA focuses on horizontal scaling, CPVA addresses the vertical scaling needs of cluster-wide services. CPVA's primary function is to automatically adjust resource requests for workloads based on the size of your cluster. This makes it a nice fit for system components or workloads that need to scale vertically as your cluster grows or shrinks. It's worth noticing that CPVA is also in beta.Similar to CPA, CPVA continuously monitors the number of schedulable nodes and cores in your cluster, using this information to adjust the CPU and memory requests for target workloads. CPVA supports linear scaling and allows you to set configurable minimum and maximum resource limits. To use CPVA in your cluster, you need to deploy the controller and configure a ConfigMap with your desired scaling parameters. You then target the workloads you want CPVA to manage, which can include Deployments, DaemonSets, or ReplicaSets.Before I finish with the relevant autoscalers section, it's important to remember that our aim wasn't to provide an exhaustive deep dive into each tool. Rather, we sought to broaden your awareness of the autoscaling tools related to the Kubernetes data-plane. The Descheduler, CPA, and CPVA each offer unique approaches to optimizing cluster resources, and knowing when and how to leverage them can significantly enhance your cluster's efficiency. However, the key takeaway is not about the tools themselves, but about the overarching goal they serve. Regardless of which autoscalers you choose to implement, the ultimate objective remains the same: to create and maintain an efficient Kubernetes cluster that optimally utilizes resources, scales smoothly with demand, and provides a robust platform for your applications.

Summary

In this chapter, you learned about the role of data plane autoscaling in Kubernetes, with a focus on two primary goals: ensuring sufficient nodes are available for all pods to be scheduled and removing underutilized resources to reduce waste, thereby optimizing costs and increasing efficiency.We began by examining CAS, a tool for managing node-level scaling in Kubernetes clusters. We delved into its operation within AWS EKS, discussing best practices for implementation and the challenges it addresses. We then introduced Karpenter, another node-level scaling solution, highlighting its differences from CAS and the different approach it brings to the table. We'll dedicate the following three chapters to Karpenter.It's very important that you don't forget that data plane autoscalers like CAS and Karpenter are responsible for cluster-level resource management only. They work in conjunction with application-level scaling solutions such as HPA, VPA, or KEDA. This complementary relationship ensures that both the infrastructure and the applications running on it can scale efficiently to meet demand. Therefore, this means that data plane autoscalers are not responsible for scheduling pods. Their role is to provide or remove the underlying infrastructure resources, while the Kubernetes scheduler remains in charge of placing pods on available nodes.We finished by exploring other relevant autoscalers, including the Descheduler, CPA, and CPVA. While these tools serve specific purposes and can enhance cluster efficiency in certain scenarios, they are not replacements for core data plane autoscalers.As we move forward, remember that the ultimate goal is to maintain an efficient Kubernetes cluster. Whether you choose CAS, Karpenter, or a combination of different autoscaling tools, the focus should always be on optimizing resource utilization, reducing costs, and ensuring your applications have the resources they need to perform optimally. In the next chapters, we'll focus on Karpenter to close the Kubernetes autoscaling loop. So let's get into it.

Kubernetes Autoscaling

Kubernetes Autoscaling: Building Efficient and Cost-Optimized Clusters with KEDA and Karpenter

1 Introduction to Kubernetes Autoscaling

Book conventions

Technical Requirements

Scalability Foundations

A Bit of History

Horizontal and Vertical Scaling

Vertical Scaling

Horizontal Scaling

Kubernetes Architecture

Efficient Kubernetes Data Planes

What do I mean by efficiency?

Challenges and considerations

Kubernetes Autoscaling Categories

Application Workloads

Data Plane Nodes

Hands-On: Creating a Kubernetes Cluster

Local Kubernetes cluster with Kind

Installing Kind

Creating a Kind Cluster

Cloud Kubernetes cluster in AWS

Creating an Amazon EKS Cluster

Remove the Amazon EKS Cluster

Summary

2 Workload Autoscaling Overview

Technical Requirements

Challenges of autoscaling workloads

How does the Kubernetes scheduler work?

Configuring requests

Configuring limits

What if you don't specify resource requests or limits?

Pod configuration example

What if the pod exceeds the resource limits?

Recommendations for configuring resources and limits

Workload Rightsizing

Monitoring

Prometheus and Grafana

Hands-On: Setting up Prometheus and Grafana

Hands-On: Determining the right size of an application

Establishing defaults

Establish default requests and limits

Hands-On: Setup default requests for CPU and Memory

Workload Autoscalers

Horizontal Pod Autoscaler (HPA)

Vertical Pod Autoscaler (VPA)

Kubernetes Event-Driven Autoscaling (KEDA)

Summary

3 Workload Autoscaling with HPA and VPA

Technical Requirements

The Kubernetes Metrics Server

Metric Server: The What and the Why

Hands-On: Setting up Metrics Server

Hands-On: Using Metrics Server

Horizontal Pod Autoscaler: Basics

How HPA scale resources?

Defining HPA Scaling Policies

Hands-On: Scaling using basic metrics with HPA

Deploy the sample application

Create the HPA autoscaling policy

Run load tests

Watch autoscaling working

HPA and Custom Metrics

How does HPA work with custom metrics?

Hands-On: Scaling using custom metrics with HPA

Deploy the Prometheus Adapter

Deploy the sample application

Deploy the Service Monitor

Create the HPA autoscaling policy

Run load tests and see HPA in action

Vertical Pod Autoscaler: Basics

How VPA scale resources?

Defining VPA Scaling Policies

Hands-On: Automatic Vertical Scaling with VPA

Deploy VPA components

Deploy the sample application

Run load tests and see VPA in action

How to work with HPA and VPA together?

Summary

4 Kubernetes Event-Driven Autoscaling (KEDA) – Part 1