February 28, 2024

Simplify Kafka cluster management: Your Step-by-Step Guide to Cruise Control

This blog delves into how cruise control can help with the management of Apache Kafka clusters. We show you how to deploy it using Strimzi and how to configure this solution.

link-icon
Linkedin icon
X icon
Facebook icon

On this page

Managing Apache Kafka is a big task, especially when using the platform at scale. Enter cruise control, one of the first tools to automate workload and self-repair management. Due to the nature of Kafka, managing these clusters can be quite challenging. It often requires a lot of manual work. Cruise Control makes this process easier and less time-consuming. Enabling users to do less manual work with clusters running more efficiently and reliably.

In this blog post, we’ll explore Cruise Control and show how it changes the way you use Kafka. Whether you’re already using Kafka or just starting, this guide will help you understand how to use Cruise Control to improve your Kafka setup.

You will read about:

  • What Cruise Control is
  • How to get and deploy Cruise Control
  • How to implement rules for auto re-balancing
  • How to implement rules for self-healing capabilities

What is cruise control

Cruise Control has been developed by LinkedIn. At this company, they created and use Apache Kafka as their messaging platform to power various geographically-distributed applications at scale. Given the popularity of their platform, it is no surprise that Kafka usage has grown exponentially.

To help with managing their clusters, LinkedIn has created various tools to help with managing everything. Cruise control is one of them and the company has open-sourced this project in 2017. You could view Cruise control as an autopilot for Kafka clusters. Cruise Control constantly analyses, fine-tunes and steers deployed cluster towards optimal performance with manual actions. To summarize, Cruise Control does the job of automating workload rebalancing and self-healing, enabling optimal cluster performance.

Why do we need a cruise control

When you have an Apache Kafka cluster running, sometimes new partitions and replicas need to be created. This process uses rack aware round-robin scheduling that doesn’t take into account broker load, partition size and load. In addition, new partitions are not assigned to existing brokers. This means that load issues on existing partitions are not resolved. Because of this, smoothly scaling your Kafka cluster becomes hard due to you needing to manually balance everything.

That is where Cruise Control comes into play. Cruise control monitors and manages Kafka resources to make sure everything is up to standard. If not, Cruise Control will automatically try to revert resources back to meet the specified requirements.

How does Cruise Control work?

Cruise Control uses recent load data from replicas to improve the cluster. It regularly gathers resource usage info from brokers and partitions. This helps understand each partition’s traffic. Using this data, it figures out how each partition affects the brokers. Cruise Control creates a workload model to simulate the Kafka cluster’s workload. The goal optimizer then suggests optimization proposals based on user-defined goals.

Additionally, Cruise Control keeps an eye on broker health. If a broker fails, it automatically shifts replicas from the failed one to healthy ones, preventing redundancy loss.

How to get started with Cruise Control

Before we dive into cluster rebalancing with Cruise Control, we have to get Cruise Control. At Axual we work with Strimzi that’s why we first go through the steps meant for configuring Cruise Control with Strimzi. If you want to configure Cruise control with OS Apache Kafka. You can find the docs here.

Instructions for deploying through Strimzi

When you use Strimzi to deploy and manage Apache Kafka on Kubernetes, you need to deploy Cruise Control as a separate Kafka resource within your cluster.

You start with getting the metrics reporters installed in the brokers of your cluster and get the Cruise Control server deployment created. To do this, add the following .yaml file to the Kafka resources of your cluster.

# cruise-control.yaml
apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
 name: <your_cluster_name>
spec:
 cruiseControl:
   <cruise_control_specs>

This will use all the default Cruise Control settings to deploy the Cruise Control server. If you are adding Cruise Control to an existing cluster (instead of starting a fresh cluster) then you will see all your Kafka broker pods roll so that the metrics reporter can be installed.

For Cruise Control specs provide your broker capacity config under brokerCapacity and cruise control configuration under config
For more information about the cruise control configuration, you can check the following link.

Deploy cruise-control.yaml using kubectl:
kubectl apply -f cruise-control.yaml -nTo verify Cruise control deployment:
kubectl get pods -n | grep cruise-control

Cruise Control requires time to read raw Kafka metrics from the cluster. It may take a few minutes for the metrics of a newly started broker to stabilize. Cruise Control filters out inconsistent metrics, like when topic bytes-in is higher than broker bytes-in. As a result, the initial windows might not have sufficient valid partitions.

Cruise control configuration

Configuring Cruise Control involves setting up properties in the cruisecontrol.properties file.

Broker Configuration

Define the basic properties of your Kafka brokers, including their IDs, hostnames, ports, and any specific configurations like log directories or security settings.

Example:

Configure Kafka bootstrap endpoint:
`bootstrap.servers=localhost:9092`

In the example above, we have provided the bootstrap hostname and port associated with the Kafka cluster, which will be managed by cruise control.

Broker Security config

If your Kafka cluster is secured, you need to configure additional security properties such as SSL/TLS settings. This will allow Cruise Control to connect to the Kafka cluster securely.

Configure security protocol
security.protocol=SSL
ssl.truststore.location=/path/to/truststore.jks
ssl.truststore.password=truststore_password
ssl.keystore.location=/path/to/keystore.jks
ssl.keystore.password=keystore_password

In the example above, we configure a secure Cruise Control connection to the Kafka cluster using SSL. To do this you need to provide the truststore and keystore locations and passwords. In addition, make sure to add Kafka broker listener CA to your truststore.

Broker capacity configuration

This configuration helps Cruise Control make informed decisions about resource allocation and cluster scaling. It can be configured using capacity.json file, then provide the location of the capacity configuration in cruisecontrol.properties: capacity.config.file=/path/to/capacity.json

Example of capacity configuration in capacity.json:
{
 "brokerCapacities":[
   {
     "brokerId": "-1",
     "capacity": {
       "DISK": "100000",
       "CPU": "100",
       "NW_IN": "10000",
       "NW_OUT": "10000"
     },
     "doc": "This is the default capacity. Capacity unit used for disk is in MB, cpu is in percentage, network throughput is in KB."
   },
   {
     "brokerId": "0",
     "capacity": {
       "DISK": "500000",
       "CPU": "100",
       "NW_IN": "50000",
       "NW_OUT": "50000"
     },
     "doc": "This overrides the capacity for broker 0."
   }
 ]
}

In the above example, you can find broker capacity config. The first part of this config is the default capacity for the cluster brokers, as the brokerId is `-1`. If you provide another capacity config for a certain broker ID it will overwrite the default configuration configured for brokerId: -1, in the example configuration for brokerId `0`. For other brokers, the capacity config will be the default capacity configuration, which configured for brokerId: -1

Configurable capacity:

  • DISK: is the capacity threshold of disk allocated for the broker, which is “100000” MB for default config and “500000” MB for the broker 0
  • CPU: is the capacity threshold of the CPUs available for the broker which is 100%, which means 1 CPU

NW_IN/NW_OUT: this configuration is for inbound and outbound network capacity, in the example it configured to be 10000 KB for all the brokers except broker 0 which has network capacity of 50000 KB

Hard Goal configuration

The hard goal configuration specifies the hard goals that Cruise Control should optimize for. Goals could include balancing leader and follower replicas evenly across brokers, minimizing disk usage, or maximizing network throughput. By defining priorities, you inform Cruise Control about which goals are most important for your specific use case.

For example to add RackAwareGoal and ReplicaCapacityGoal:
hard.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,
com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal

You will find more info about our hard goals configuration in the following section.

For more info check cruise control configuration documentation.

Apache Kafka cluster rebalancing

The first thing we are going to focus on is the rebalancing of Kafka clusters. A cluster can be unbalanced in different ways. Network utilization, RAM and CPU utilization and disk utilization can all be unbalanced. All these things can be unbalanced due to the partition distribution among Kafka brokers. We don’t mean the number of leaders and replicas within the brokers, but the CPU, memory and network usage of the different brokers. In addition, the Kafka cluster can be unbalanced in terms of storage use due to some partitions growing in size.

Before Cruise Control, rebalancing needed to be done manually by configuring the kafka-reassign-partitions.sh file. This required a lot of manual labour, thus being quite time-consuming.

Cruise Control helps with these previously mentioned manual tasks. It does this by specifying hard goals and soft goals. A hard goal is one that must be satisfied. Soft goals don’t have to be satisfied if that allows a hard goal to be satisfied.

Hard goals

  • Replicas that belong to the same partitions have to be in different racks
  • The utilization of each resource should be below certain threshold
  • All hard goals should be met every time Cruise Control re-balances a cluster

The current default goals we have in our cruise control config is:

RackAwareGoal: This goal aims to ensure that replicas of a partition are distributed across different racks within a Kafka cluster. The goal is to enhance fault tolerance by minimizing the risk of losing all replicas of a partition in case an entire rack goes down.

ReplicaCapacityGoal: This goal aims to ensure that the capacity of Kafka brokers is effectively utilized for storing replicas. It helps in achieving a balanced distribution of replicas across the available broker resources, such as disk space. For example:

  • A value of 5% means that Cruise Control will attempt to limit the movement or rebalancing of replicas to 5% of the total replicas per broker.
  • For example, if a broker hosts 100 replicas, this throttle setting will allow Cruise Control to move a maximum of 5 replicas (5% of 100) to or from that broker during any optimization cycle.

DiskCapacityGoal: This goal focuses on balancing the distribution of replicas across Kafka brokers based on their available disk capacity. This goal is particularly useful for preventing uneven disk space usage among brokers and ensuring efficient utilization of the available storage resources.

NetworkInboundCapacityGoal: Ensures that inbound network utilization of each broker is below the threshold

NetworkOutboundCapacityGoal: Ensures that outbound network utilization of each broker is below the threshold

CpuCapacityGoal: This goal is refer to the total amount of computational resources a Kafka broker

LeaderBytesInDistributionGoal: Attempts to equalize the leader bytes in rate on each host.

NetworkInboundUsageDistributionGoal: Attempts to keep the inbound network utilization variance among brokers within a certain range relative to the average inbound network utilization

NetworkOutboundUsageDistributionGoal: Attempts to keep the outbound network utilization variance among brokers within a certain range relative to the average outbound network utilization.

CpuUsageDistributionGoal: Attempts to keep the CPU usage variance among brokers within a certain range relative to the average CPU utilization.

Capacity unit used for disk is in MiB, cpu is in number of cores, network throughput is in KiB.

Soft goals

  • Resource utilization for each broker should not differ more from each other than a certain percentage.
  • The partition of each topic (and of the all topic) should be distributed as evenly as possible among the different brokers.

Cruise control tries to reduce load on Kafka brokers by moving the leader from one broker to another, also moving the partitions themselves to reduce disk usage.

Kafka cluster self-healing

In addition to re-balancing, Cruise Control also helps to automate self-healing of Kafka clusters. Within Cruise Control self-healing means optimizing the cluster to meet hard- or soft goals. The self-healing capabilities of Kafka Cruise Control typically include:

  1. Goal-Based Optimization: Cruise Control continuously monitors the Kafka cluster to ensure that predefined goals (such as even distribution of partitions across brokers or workload balancing) are met. If it detects any deviations from these goals, it takes actions to bring the cluster back to the desired state.
  1. Anomaly Detection: Cruise Control identifies anomalies or issues within a Kafka cluster by analysing metrics like broker CPU usage, disk utilization, partition leadership distribution, etc. Anomalies might include broker failures, underperforming partitions, or imbalanced loads.
  1. Automated Rebalancing: When Cruise Control detects an issue, it can automatically trigger rebalancing operations. For instance, if it detects an uneven distribution of partitions across brokers, it will initiate the migration of partitions to achieve a more balanced state.
  1. Resource Optimization: It can adjust resource allocations, like moving replicas to different brokers to alleviate hotspots, optimize disk usage, or reconfigure topics for better performance based on observed patterns.
  1. Cluster Scaling: Cruise Control can recommend and execute scaling actions, such as adding or removing broker instances, to accommodate changes in workload or to prevent resource constraints.

Cruise control UI

Cruise Control has different tabs to help and analyse Kafka Clusters. We will go through the different tabs to explain their purpose and what you can see when opened.

Cluster state

The cluster state tab shows you, as the name might suggest, the status of your Kafka cluster. This means that you can view the number of brokers, partitions, average replication factor and the partition status.

Cluster load

The cluster load tab allows you to view things associated with the current workload of a cluster.

this tab shows the number of replicas per broker, number of leader partitions, disk and CPU usage per broker, and network rate

You can find below description of network rate values:

  1. Leader In: “Leader In” typically represents the rate at which a broker is receiving data from leaders of other partitions. It’s a measure of how much data is being ingested by the broker from leaders.
  2. Follower In: “Follower In” is the rate at which a broker is receiving data from other followers. Followers replicate data from leaders to stay up-to-date. This metric shows the data flow to followers.
  3. Network Out: This metric shows the rate at which data is sent from a broker to other brokers or consumers. It measures the network traffic going out of the broker.
  4. Potential Out: “Potential Out” is a metric indicating the rate at which a broker could send data if needed. It’s often used to assess the capacity of a broker to handle additional load.
  5. LF Ratio (Leader to Follower Ratio): The LF Ratio is the ratio of leaders to followers on a broker. A high LF Ratio can indicate that a broker is overloaded with leadership responsibilities, potentially impacting performance.
  6. IO Ratio (Input/Output Ratio): The IO Ratio represents the ratio of input (e.g., data ingested) to output (e.g., data sent). It’s used to assess how efficiently a broker is handling data flow in relation to its processing and network capabilities.

Partition load

in this tab we can list partition load regarding CPU, disk, and network in/out, as example you can list the disk load for partitions sorted by max load, check the screenshot below:

Cruise control state

in this tab we can see the executor state and in progress tasks, Monitor state, Analyser state shows the goals status, Anomaly detector state to check the cluster health status, as the screenshot below shows, currently the cluster self healing is disabled, and we can see the recent goals violation.

Cruise control proposals

in this tab we can check what the proposed replicas and leader movements proposed by cruise control to get to the optimal state of kafka cluster and also we can check the goals status if we have any goals violation.

we can apply this the replicas and leader movements using Kafka cluster administration tab with Rebalance cluster option


Get in touch to discuss your specific use case with our Kafka architects. Or start you Kafka journey with Axual Platform. With Kafka solution, organizations can scale their development teams around a central Kafka. Our platform comes with built-in self-service, data governance, and security functionalities to unlock the full potential of Kafka for your development teams. The graphical user interface makes it easy for teams to control their clusters, topics, applications, and schemas from one, central overview.

Table name
Lorem ipsum
Lorem ipsum
Lorem ipsum

Answers to your questions about Axual’s All-in-one Kafka Platform

Are you curious about our All-in-one Kafka platform? Dive into our FAQs
for all the details you need, and find the answers to your burning questions.

What is Apache Kafka Cruise Control?

Apache Kafka Cruise Control is a tool designed to automate the management and optimization of Apache Kafka clusters. It provides features for cluster balancing, resource utilization, and performance monitoring, allowing operators to manage Kafka efficiently without manual intervention. Cruise Control helps ensure that Kafka clusters maintain optimal performance and stability by automatically redistributing partitions and managing broker loads.

How does Cruise Control improve Kafka cluster performance?

Cruise Control enhances Kafka cluster performance by continuously monitoring broker metrics and automatically rebalancing partitions based on predefined goals, such as optimizing load distribution and minimizing latency. It identifies underutilized or overutilized brokers and redistributes partitions to ensure a balanced workload, improving resource utilization and overall cluster efficiency.

What are the key features of Apache Kafka Cruise Control?

Key features of Apache Kafka Cruise Control include automatic partition rebalancing, real-time monitoring of broker metrics, a user-friendly REST API for management, and customizable goals for performance optimization. Additionally, it offers a simulation feature to preview the impact of proposed changes before applying them, ensuring safer and more informed decision-making in cluster management.

Emad Barsa
Emad Barsa
Software Engineer

Related blogs

View all
Rachel van Egmond
Rachel van Egmond
January 20, 2025
What is Kafka Software Used For? Real-Time Use Cases Explained
What is Kafka Software Used For? Real-Time Use Cases Explained

Explore what Kafka software is used for, from enabling real-time data streaming to powering event-driven applications. Learn how it transforms industries with seamless data handling.

Apache Kafka
Apache Kafka
Rachel van Egmond
Rachel van Egmond
January 17, 2025
Kafka Software explained
Kafka Software explained

Apache Kafka software has revolutionized how businesses manage and process real-time data. Developed by the Apache Software Foundation, Kafka serves as a distributed event store and stream-processing platform, powering everything from financial transactions and logistics tracking to IoT data analysis and customer interaction monitoring. In this blog, we'll explore how Apache Kafka enables event streaming, its versatile use cases, and why it’s the backbone of modern, data-driven applications. Whether you're new to Kafka or looking to deepen your understanding, this guide offers insights to help you harness the power of real-time data.

Apache Kafka
Apache Kafka
Rachel van Egmond
Rachel van Egmond
January 10, 2025
From Vendor Lock-In to Open Source: Alliander’s Success Story
From Vendor Lock-In to Open Source: Alliander’s Success Story

Alliander’s move to open-source Kafka highlights the power of independence, innovation, and adaptability. Explore their journey and key lessons for overcoming vendor lock-in challenges.

Apache Kafka for Business
Apache Kafka for Business