Check out:

Release blog 2025.2 - The Summer Release

February 28, 2024

Simplify Kafka cluster management: Your Step-by-Step Guide to Cruise Control

This blog delves into how cruise control can help with the management of Apache Kafka clusters. We show you how to deploy it using Strimzi and how to configure this solution.

Blog

›

Managing Apache Kafka is a big task, especially when using the platform at scale. Enter cruise control, one of the first tools to automate workload and self-repair management. Due to the nature of Kafka, managing these clusters can be quite challenging. It often requires a lot of manual work. Cruise Control makes this process easier and less time-consuming. Enabling users to do less manual work with clusters running more efficiently and reliably.

In this blog post, we’ll explore Cruise Control and show how it changes the way you use Kafka. Whether you’re already using Kafka or just starting, this guide will help you understand how to use Cruise Control to improve your Kafka setup.

You will read about:

What Cruise Control is
How to get and deploy Cruise Control
How to implement rules for auto re-balancing
How to implement rules for self-healing capabilities

What is cruise control

Cruise Control has been developed by LinkedIn. At this company, they created and use Apache Kafka as their messaging platform to power various geographically-distributed applications at scale. Given the popularity of their platform, it is no surprise that Kafka usage has grown exponentially.

To help with managing their clusters, LinkedIn has created various tools to help with managing everything. Cruise control is one of them and the company has open-sourced this project in 2017. You could view Cruise control as an autopilot for Kafka clusters. Cruise Control constantly analyses, fine-tunes and steers deployed cluster towards optimal performance with manual actions. To summarize, Cruise Control does the job of automating workload rebalancing and self-healing, enabling optimal cluster performance.

Why do we need a cruise control

When you have an Apache Kafka cluster running, sometimes new partitions and replicas need to be created. This process uses rack aware round-robin scheduling that doesn’t take into account broker load, partition size and load. In addition, new partitions are not assigned to existing brokers. This means that load issues on existing partitions are not resolved. Because of this, smoothly scaling your Kafka cluster becomes hard due to you needing to manually balance everything.

That is where Cruise Control comes into play. Cruise control monitors and manages Kafka resources to make sure everything is up to standard. If not, Cruise Control will automatically try to revert resources back to meet the specified requirements.

How does Cruise Control work?

Cruise Control uses recent load data from replicas to improve the cluster. It regularly gathers resource usage info from brokers and partitions. This helps understand each partition’s traffic. Using this data, it figures out how each partition affects the brokers. Cruise Control creates a workload model to simulate the Kafka cluster’s workload. The goal optimizer then suggests optimization proposals based on user-defined goals.

Additionally, Cruise Control keeps an eye on broker health. If a broker fails, it automatically shifts replicas from the failed one to healthy ones, preventing redundancy loss.

How to get started with Cruise Control

Before we dive into cluster rebalancing with Cruise Control, we have to get Cruise Control. At Axual we work with Strimzi that’s why we first go through the steps meant for configuring Cruise Control with Strimzi. If you want to configure Cruise control with OS Apache Kafka. You can find the docs here.

Instructions for deploying through Strimzi

When you use Strimzi to deploy and manage Apache Kafka on Kubernetes, you need to deploy Cruise Control as a separate Kafka resource within your cluster.

You start with getting the metrics reporters installed in the brokers of your cluster and get the Cruise Control server deployment created. To do this, add the following .yaml file to the Kafka resources of your cluster.

# cruise-control.yaml apiVersion: kafka.strimzi.io/v1beta1 kind: Kafka metadata: name: <your_cluster_name> spec: cruiseControl: <cruise_control_specs>

This will use all the default Cruise Control settings to deploy the Cruise Control server. If you are adding Cruise Control to an existing cluster (instead of starting a fresh cluster) then you will see all your Kafka broker pods roll so that the metrics reporter can be installed.

For Cruise Control specs provide your broker capacity config under brokerCapacity and cruise control configuration under config
For more information about the cruise control configuration, you can check the following link.

Deploy cruise-control.yaml using kubectl: kubectl apply -f cruise-control.yaml -nTo verify Cruise control deployment: kubectl get pods -n | grep cruise-control

Cruise Control requires time to read raw Kafka metrics from the cluster. It may take a few minutes for the metrics of a newly started broker to stabilize. Cruise Control filters out inconsistent metrics, like when topic bytes-in is higher than broker bytes-in. As a result, the initial windows might not have sufficient valid partitions.

Cruise control configuration

Configuring Cruise Control involves setting up properties in the cruisecontrol.properties file.

Broker Configuration

Define the basic properties of your Kafka brokers, including their IDs, hostnames, ports, and any specific configurations like log directories or security settings.

Example:

Configure Kafka bootstrap endpoint: `bootstrap.servers=localhost:9092`

In the example above, we have provided the bootstrap hostname and port associated with the Kafka cluster, which will be managed by cruise control.

Broker Security config

If your Kafka cluster is secured, you need to configure additional security properties such as SSL/TLS settings. This will allow Cruise Control to connect to the Kafka cluster securely.

Configure security protocol security.protocol=SSL ssl.truststore.location=/path/to/truststore.jks ssl.truststore.password=truststore_password ssl.keystore.location=/path/to/keystore.jks ssl.keystore.password=keystore_password

In the example above, we configure a secure Cruise Control connection to the Kafka cluster using SSL. To do this you need to provide the truststore and keystore locations and passwords. In addition, make sure to add Kafka broker listener CA to your truststore.

Broker capacity configuration

This configuration helps Cruise Control make informed decisions about resource allocation and cluster scaling. It can be configured using capacity.json file, then provide the location of the capacity configuration in cruisecontrol.properties: capacity.config.file=/path/to/capacity.json

Example of capacity configuration in capacity.json: { "brokerCapacities":[ { "brokerId": "-1", "capacity": { "DISK": "100000", "CPU": "100", "NW_IN": "10000", "NW_OUT": "10000" }, "doc": "This is the default capacity. Capacity unit used for disk is in MB, cpu is in percentage, network throughput is in KB." }, { "brokerId": "0", "capacity": { "DISK": "500000", "CPU": "100", "NW_IN": "50000", "NW_OUT": "50000" }, "doc": "This overrides the capacity for broker 0." } ] }

In the above example, you can find broker capacity config. The first part of this config is the default capacity for the cluster brokers, as the brokerId is `-1`. If you provide another capacity config for a certain broker ID it will overwrite the default configuration configured for brokerId: -1, in the example configuration for brokerId `0`. For other brokers, the capacity config will be the default capacity configuration, which configured for brokerId: -1

Configurable capacity:

DISK: is the capacity threshold of disk allocated for the broker, which is “100000” MB for default config and “500000” MB for the broker 0
CPU: is the capacity threshold of the CPUs available for the broker which is 100%, which means 1 CPU

NW_IN/NW_OUT: this configuration is for inbound and outbound network capacity, in the example it configured to be 10000 KB for all the brokers except broker 0 which has network capacity of 50000 KB

Hard Goal configuration

The hard goal configuration specifies the hard goals that Cruise Control should optimize for. Goals could include balancing leader and follower replicas evenly across brokers, minimizing disk usage, or maximizing network throughput. By defining priorities, you inform Cruise Control about which goals are most important for your specific use case.

For example to add RackAwareGoal and ReplicaCapacityGoal: hard.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal, com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal

You will find more info about our hard goals configuration in the following section.

For more info check cruise control configuration documentation.

Apache Kafka cluster rebalancing

The first thing we are going to focus on is the rebalancing of Kafka clusters. A cluster can be unbalanced in different ways. Network utilization, RAM and CPU utilization and disk utilization can all be unbalanced. All these things can be unbalanced due to the partition distribution among Kafka brokers. We don’t mean the number of leaders and replicas within the brokers, but the CPU, memory and network usage of the different brokers. In addition, the Kafka cluster can be unbalanced in terms of storage use due to some partitions growing in size.

Before Cruise Control, rebalancing needed to be done manually by configuring the kafka-reassign-partitions.sh file. This required a lot of manual labour, thus being quite time-consuming.

Cruise Control helps with these previously mentioned manual tasks. It does this by specifying hard goals and soft goals. A hard goal is one that must be satisfied. Soft goals don’t have to be satisfied if that allows a hard goal to be satisfied.

Hard goals

Replicas that belong to the same partitions have to be in different racks
The utilization of each resource should be below certain threshold
All hard goals should be met every time Cruise Control re-balances a cluster

The current default goals we have in our cruise control config is:

RackAwareGoal: This goal aims to ensure that replicas of a partition are distributed across different racks within a Kafka cluster. The goal is to enhance fault tolerance by minimizing the risk of losing all replicas of a partition in case an entire rack goes down.

ReplicaCapacityGoal: This goal aims to ensure that the capacity of Kafka brokers is effectively utilized for storing replicas. It helps in achieving a balanced distribution of replicas across the available broker resources, such as disk space. For example:

A value of 5% means that Cruise Control will attempt to limit the movement or rebalancing of replicas to 5% of the total replicas per broker.
For example, if a broker hosts 100 replicas, this throttle setting will allow Cruise Control to move a maximum of 5 replicas (5% of 100) to or from that broker during any optimization cycle.

DiskCapacityGoal: This goal focuses on balancing the distribution of replicas across Kafka brokers based on their available disk capacity. This goal is particularly useful for preventing uneven disk space usage among brokers and ensuring efficient utilization of the available storage resources.

NetworkInboundCapacityGoal: Ensures that inbound network utilization of each broker is below the threshold

NetworkOutboundCapacityGoal: Ensures that outbound network utilization of each broker is below the threshold

CpuCapacityGoal: This goal is refer to the total amount of computational resources a Kafka broker

LeaderBytesInDistributionGoal: Attempts to equalize the leader bytes in rate on each host.

NetworkInboundUsageDistributionGoal: Attempts to keep the inbound network utilization variance among brokers within a certain range relative to the average inbound network utilization

NetworkOutboundUsageDistributionGoal: Attempts to keep the outbound network utilization variance among brokers within a certain range relative to the average outbound network utilization.

CpuUsageDistributionGoal: Attempts to keep the CPU usage variance among brokers within a certain range relative to the average CPU utilization.

Capacity unit used for disk is in MiB, cpu is in number of cores, network throughput is in KiB.

Soft goals

Resource utilization for each broker should not differ more from each other than a certain percentage.
The partition of each topic (and of the all topic) should be distributed as evenly as possible among the different brokers.

Cruise control tries to reduce load on Kafka brokers by moving the leader from one broker to another, also moving the partitions themselves to reduce disk usage.

Kafka cluster self-healing

In addition to re-balancing, Cruise Control also helps to automate self-healing of Kafka clusters. Within Cruise Control self-healing means optimizing the cluster to meet hard- or soft goals. The self-healing capabilities of Kafka Cruise Control typically include:

Goal-Based Optimization: Cruise Control continuously monitors the Kafka cluster to ensure that predefined goals (such as even distribution of partitions across brokers or workload balancing) are met. If it detects any deviations from these goals, it takes actions to bring the cluster back to the desired state.

Anomaly Detection: Cruise Control identifies anomalies or issues within a Kafka cluster by analysing metrics like broker CPU usage, disk utilization, partition leadership distribution, etc. Anomalies might include broker failures, underperforming partitions, or imbalanced loads.

Automated Rebalancing: When Cruise Control detects an issue, it can automatically trigger rebalancing operations. For instance, if it detects an uneven distribution of partitions across brokers, it will initiate the migration of partitions to achieve a more balanced state.

Resource Optimization: It can adjust resource allocations, like moving replicas to different brokers to alleviate hotspots, optimize disk usage, or reconfigure topics for better performance based on observed patterns.

Cluster Scaling: Cruise Control can recommend and execute scaling actions, such as adding or removing broker instances, to accommodate changes in workload or to prevent resource constraints.

Cruise control UI

Cruise Control has different tabs to help and analyse Kafka Clusters. We will go through the different tabs to explain their purpose and what you can see when opened.

Cluster state

The cluster state tab shows you, as the name might suggest, the status of your Kafka cluster. This means that you can view the number of brokers, partitions, average replication factor and the partition status.

Cluster load

The cluster load tab allows you to view things associated with the current workload of a cluster.

this tab shows the number of replicas per broker, number of leader partitions, disk and CPU usage per broker, and network rate

You can find below description of network rate values:

Leader In: “Leader In” typically represents the rate at which a broker is receiving data from leaders of other partitions. It’s a measure of how much data is being ingested by the broker from leaders.
Follower In: “Follower In” is the rate at which a broker is receiving data from other followers. Followers replicate data from leaders to stay up-to-date. This metric shows the data flow to followers.
Network Out: This metric shows the rate at which data is sent from a broker to other brokers or consumers. It measures the network traffic going out of the broker.
Potential Out: “Potential Out” is a metric indicating the rate at which a broker could send data if needed. It’s often used to assess the capacity of a broker to handle additional load.
LF Ratio (Leader to Follower Ratio): The LF Ratio is the ratio of leaders to followers on a broker. A high LF Ratio can indicate that a broker is overloaded with leadership responsibilities, potentially impacting performance.
IO Ratio (Input/Output Ratio): The IO Ratio represents the ratio of input (e.g., data ingested) to output (e.g., data sent). It’s used to assess how efficiently a broker is handling data flow in relation to its processing and network capabilities.

Partition load

in this tab we can list partition load regarding CPU, disk, and network in/out, as example you can list the disk load for partitions sorted by max load, check the screenshot below:

Cruise control state

in this tab we can see the executor state and in progress tasks, Monitor state, Analyser state shows the goals status, Anomaly detector state to check the cluster health status, as the screenshot below shows, currently the cluster self healing is disabled, and we can see the recent goals violation.

Cruise control proposals

in this tab we can check what the proposed replicas and leader movements proposed by cruise control to get to the optimal state of kafka cluster and also we can check the goals status if we have any goals violation.

we can apply this the replicas and leader movements using Kafka cluster administration tab with Rebalance cluster option

Get in touch to discuss your specific use case with our Kafka architects. Or start you Kafka journey with Axual Platform. With Kafka solution, organizations can scale their development teams around a central Kafka. Our platform comes with built-in self-service, data governance, and security functionalities to unlock the full potential of Kafka for your development teams. The graphical user interface makes it easy for teams to control their clusters, topics, applications, and schemas from one, central overview.

‍

Download the Use Case

Download for free; no credentials are needed

Table name

Lorem ipsum

Answers to your questions about Axual’s All-in-one Kafka Platform

Are you curious about our All-in-one Kafka platform? Dive into our FAQs
for all the details you need, and find the answers to your burning questions.

What is Apache Kafka Cruise Control?

Apache Kafka Cruise Control is a tool designed to automate the management and optimization of Apache Kafka clusters. It provides features for cluster balancing, resource utilization, and performance monitoring, allowing operators to manage Kafka efficiently without manual intervention. Cruise Control helps ensure that Kafka clusters maintain optimal performance and stability by automatically redistributing partitions and managing broker loads.

How does Cruise Control improve Kafka cluster performance?

Cruise Control enhances Kafka cluster performance by continuously monitoring broker metrics and automatically rebalancing partitions based on predefined goals, such as optimizing load distribution and minimizing latency. It identifies underutilized or overutilized brokers and redistributes partitions to ensure a balanced workload, improving resource utilization and overall cluster efficiency.

What are the key features of Apache Kafka Cruise Control?

Key features of Apache Kafka Cruise Control include automatic partition rebalancing, real-time monitoring of broker metrics, a user-friendly REST API for management, and customizable goals for performance optimization. Additionally, it offers a simulation feature to preview the impact of proposed changes before applying them, ensuring safer and more informed decision-making in cluster management.

Related blogs

View all

Jeroen van Disseldorp

July 1, 2025

Release blog 2025.2 - The Summer Release

The Axual 2025.2 summer release delivers targeted improvements for enterprise-grade Kafka deployments. In this post, we walk through the latest updates—from enhanced audit tracking and OAuth support in the REST Proxy to smarter stream processing controls in KSML. These features are designed to solve the real-world governance, security, and operational challenges enterprises face when scaling Kafka across teams and systems.

Axual Product

Jeroen van Disseldorp

April 4, 2025

Release blog 2025.1 - The Spring Release

Axual 2025.1 is here with exciting new features and updates. Whether you're strengthening security, improving observability, or bridging old legacy systems with modern event systems, like Kafka, Axual 2025.1 is built to keep you, your fellow developers, and engineers ahead of the game.

Axual Product

February 21, 2025

Kafka Consumer Groups and Offsets: What You Need to Know

Consumer group offsets are essential components in Apache Kafka, a leading platform for handling real-time event streaming. By allowing organizations to scale efficiently, manage data consumption, and track progress in data processing, Kafka’s consumer groups and offsets ensure reliability and performance. In this blog post, we'll dive deep into these concepts, explain how consumer groups and offsets work, and answer key questions about their functionality. We'll also explore several practical use cases that show how Kafka’s consumer groups and offsets drive real business value, from real-time analytics to machine learning pipelines.

Apache Kafka

Simplify Kafka cluster management: Your Step-by-Step Guide to Cruise Control

On this page

What is cruise control

Why do we need a cruise control

How does Cruise Control work?

How to get started with Cruise Control

Instructions for deploying through Strimzi

Cruise control configuration

Broker Configuration

Broker Security config

Broker capacity configuration

Hard Goal configuration

Apache Kafka cluster rebalancing

Hard goals

Soft goals

Kafka cluster self-healing

Cruise control UI

Cluster state

Cluster load

Partition load

Cruise control state

Cruise control proposals

Download the Use Case

Answers to your questions about Axual’s All-in-one Kafka Platform

Related blogs