Introduction Zookeeper to KRaft
For years, Zookeeper has been integral to Kafka deployments as a reliable metadata management system. However, with its limitations and Kafka’s evolution through KIP-500, the shift to KRaft—a self-managed metadata quorum—marks a new era. This transition is critical as Zookeeper’s deprecation accelerates, with its removal planned in Kafka 4.0. Adapting now ensures your Kafka clusters remain future-ready and efficient.
On this page
Zookeeper has been a cornerstone in any Kafka deployment for as long as anybody can remember. It has proven to be a reliable part of the Kafka landscape as a dedicated and centralized metadata management system. However, Zookeeper is not without its downsides, which are mainly related to limiting the Kafka cluster itself. In comes the famous KIP-500 where a different approach towards metadata management for Kafka is proposed: a self-managed metadata quorum. With the implementation of this improvement, Zookeeper has been markedly deprecated since the version 3.5.0 release of Kafka on the 15th of June, 2023. At the time of writing, version 3.9 of Kafka has just been released, and the imminent removal of Zookeeper in Kafka version 4.0 is just around the corner. This calls for action on our existing Kafka cluster; the migration of Zookeeper to KRaft is in order!
Simplify Kubernetes Management with the Strimzi Operator
Deploying Kafka can be done in a number of ways. A very popular one nowadays is deploying your Kafka clusters in Kubernetes. Of course, this brings a set of challenges that need to be considered. How will security be arranged, how will certificates be managed, which kinds of listeners will be exposed and in what way... and several other unknowns. Enter the Strimzi Operator (https://strimzi.io/). This tool will make managing one or more Kafka clusters much easier. It has grown into a CNCF incubating project, has proven to be reliable, and shows great promise for managing your Kafka deployments in the future.
Effortless Kafka Migration
Combine the need to migrate your Kafka cluster from Zookeeper to KRaft with the use of the Strimzi Operator, and you will notice that this process is quite a breeze. It turns out that the Strimzi Operator takes away a lot of the difficulty and reconfiguration of your Kafka nodes. Some details on how Strimzi takes care of your migration while you watch have been written down quite well, including the steps to make this happen. An explanation of how the migration works can be found in this Strimzi blog post, and the exact steps performed here. With those already documented quite well, in this blog post, we will focus more on what is happening with the Kafka nodes themselves, i.e., what is the configuration that changes, how does this affect the Kafka cluster, when will nodes have to restart?
The starting position: Zookeeper-based deployment
The initial state of the cluster is a relatively simple one. There are 3 Kafka brokers and there are 3 Zookeeper nodes. The Kafka version used is 3.8.0, and it has been deployed using the Strimzi Operator version 0.44.0. Next to a Kafka cluster, there is also a Kafka exporter deployed; this component exposes some relevant metrics regarding the health of the Kafka cluster. There is Cruise Control, and there is the Strimzi Entity Operator (topic and user) for the creation and modification of topics, users, and ACLs.
Starting the migration
Following the steps outlined in the Strimzi blog regarding the migration, the first step is to create a new KafkaNodePool resource, which defines the Kafka controllers to be created. Once this resource is present, the Kafka resource itself is annotated with the `strimzi.io/kraft="migration"` annotation. This annotation triggers the Strimzi Operator to do a few things. It will deploy a set of Kafka Controllers, which will be configured with Zookeeper connection details, an inter-broker listener configured, and the KRaft configuration itself, such as quorum voters and the specific controller listener. And do not forget the migration flag: zookeeper.metadata.migration.enable=true. Once these controllers are up and running, the brokers will be restarted specifically to enable migration. The few notable changes in the configuration of the brokers are the KRaft connection details (quorum voters and listeners) and the migration flag. Once all the brokers have restarted, the metadata migration is performed.
Intermezzo: keeping an eye out
It is quite obvious that Strimzi takes a lot of the manual work out of the picture by allowing the user (the human operator) to simply set a single annotation to kick-start the migration process. However, due diligence is a thing and it is wise to keep an eye on your cluster while the migration is happening. The first thing to keep an eye on is the Kafka resource in Kubernetes itself. This has a Metadata State status, which will start as "Zookeeper" but will move through different phases during the migration process. When the migration has started, this status will change to "KRaftMigration." Through the initial phase of the migration process, the status will go through "KRaftDualWriting" and end up being "KRaftPostMigration." This is the part where the user will be allowed to either finalize the migration or perform a rollback. It should be noted that this state is Strimzi specific and technically, the cluster is still in dual writing mode, having metadata in the controllers as well as copies in Zookeeper.
Strimzi makes it easy to properly monitor the cluster through exposing metrics in several components like the Kafka cluster, Kafka exporter and the Operators themselves. Using service/podmonitors, these can be scraped by Prometheus and then visualized using the dashboards that Strimzi provides. This allows the user to have an easy overview of the health of the Kafka cluster during the migration.
Kafka Migration Monitoring with Real-Time Application Metrics
To have an even better grasp on things during the migration, it is advised to also have one or more Kafka applications running that are connected to the Kafka cluster. For example, at Axual there is a producer application running that is measuring the latency of produce requests to the Kafka cluster. This application exposes the latency, as well as any error, counts as metrics, and shows these in a dashboard. The app and the topic it produces are set up in such a way that there is a proper check to each and every one of the brokers in the cluster to be sure that there will not be a false "perfect health" status, whereas there might be a single broker not functioning properly. Though this should become very clear quite quickly through the other dashboards that will be monitored.
Another tool used at Axual is an application capable of quickly comparing two complete sets of ACLs from a Kafka cluster and indicating whether they are identical or, if not, what the differences are. This could prove useful, especially in a metadata migration scenario such as this.
Concluding phase one of the migration
Once the migration of metadata is complete, the Kafka brokers are once again reconfigured and another rolling restart of the Kafka brokers is triggered. Most notably, this reconfiguration involves the removal of Zookeeper specific connection details as well as the update of the ACL authorizer to the "StandardAuthorizer" and of course the removal of the migration flag.
Point of no return
As mentioned before, entering the "KRaftPostMigration" phase allows the user to decide to finalize or rollback. This should be a conscious choice since once finalized and all connections to Zookeeper are cut; there is no (easy) going back to the Zookeeper-managed metadata state. A possible check to see if the metadata management is working as expected is to delete the KafkaUser used by the producer application. This should instantly update the Kafka metadata, and the application should start logging errors. Another check that can be done at this point is to compare the full list of ACLs before and after the deletion, which should indicate that the ACL on the specific producer topic has changed, such that this user should have been removed from it. When it is checked and concluded that the Kafka cluster functions properly, the migration will be finalized. To do this, the same annotation is used with a different value: `strimzi.io/kraft="enabled"`. This will trigger a restart of the controllers to get rid of the migration flag and the Zookeeper connection details, among other things. Once restarted, Strimzi will take care of the removal of the Zookeeper nodes. As this is happening, the Kafka metadata state will move from "KRaftPostMigration", through "PreKRaft" to the "KRaft" state.
Cleanup
As the Kafka resource resembles the "KRaft" state, it might also have some warnings. These are probably related to the inter.broker.protocol.version that is present, or they can also be related to the presence of the Zookeeper part in the Kafka resource itself, which is, of course, no longer supported at this point. It is the user's task to clean these up, potentially triggering another rolling restart.
Conclusion
Migrating a Kafka cluster to KRaft using the Strimzi Operator is a task that is mostly automated and taken care of by the Strimzi Operator. The process has been clearly documented by the Strimzi blogs mentioned before. Combined with this blog, it should give the user a rather complete overview of what is happening in each step, what the high-level changes are, and some of the technical details.
On-demand webinar Migration Zookeeper to KRaft
If you’re curious to see a live demonstration of migrating a Kafka cluster from Zookeeper to KRaft, you can watch the on-demand webinar hosted by Daniel. Daniel walks through the entire process in this session, showcasing how the migration unfolds in real-time. It’s an excellent opportunity to gain practical insights and see the Strimzi Operator in action as it simplifies this complex transition. Watch the webinar to deepen your understanding and gain confidence in managing your Kafka migrations!
Download the Use Case
Download for free; no credentials are neededAnswers to your questions about Axual’s All-in-one Kafka Platform
Are you curious about our All-in-one Kafka platform? Dive into our FAQs
for all the details you need, and find the answers to your burning questions.
The main difference is that KRaft is making use of the Kafka protocol for metadata management, which is significantly faster than Zookeeper.
When the metadata management has been moved from Zookeeper to KRaft, Zookeeper is completely out of the picture. Moreover, starting at version 4.0 of Kafka, Zookeeper will be removed.
When Zookeeper goes down, the metadata is no longer available to the Kafka brokers. Any update request for metadata will fail and your Kafka cluster will end up in a failed state.
Related blogs
This blog dives deep into Kafka Connect clusters, unraveling their structure, scaling strategies, and task management processes. Whether you're designing a high-availability system, troubleshooting task distribution, or scaling your pipeline for performance, this article provides a comprehensive look at how Kafka Connect clusters operate.
Norsk Helsenett (NHN) is revolutionizing Norway's fragmented healthcare landscape with a scalable Kafka ecosystem. Bridging 17,000 organizations ensures secure, efficient communication across hospitals, municipalities, and care providers.
Apache Kafka has become a central component of modern data architectures, enabling real-time data streaming and integration across distributed systems. Within Kafka’s ecosystem, Kafka Connect plays a crucial role as a powerful framework designed for seamlessly moving data between Kafka and external systems. Kafka Connect provides a standardized, scalable approach to data integration, removing the need for complex custom scripts or applications. For architects, product owners, and senior engineers, Kafka Connect is essential to understand because it simplifies data pipelines and supports low-latency, fault-tolerant data flow across platforms. But what exactly is Kafka Connect, and how can it benefit your architecture?