A Finnish TSO’s Seamless Migration from Confluent to Strimzi

Industry:

Energy

Results:

Enabled a zero-downtime migration from Confluent to Strimzi

‍

Energy

About the Finnish TSO

This Finnish transmission system operator manages the national power grid, ensuring reliable electricity distribution across Finland. Operating critical infrastructure that powers millions of homes and businesses, the organization processes real-time data from thousands of grid sensors, smart meters, and energy trading systems. As part of the Nordic synchronized grid system, they maintain interconnections with neighboring countries while adhering to stringent ENTSO-E regulations.

Goals & context

The Finnish TSO operated a Confluent-based Kafka infrastructure processing millions of events daily from grid sensors, energy trading systems, and smart meters. While stable, the vendor lock-in increasingly constrained their architectural flexibility and created budgetary pressure. The organization sought to regain control by migrating to Strimzi, the open-source Kubernetes-native Kafka operator.

However, this Kafka infrastructure formed the backbone of national grid management. Any service interruption could destabilize energy distribution, impact industrial operations, and potentially cascade across the Nordic synchronized grid. The TSO's operational requirements were absolute: zero downtime, zero data loss, and maintained sub-millisecond latency for grid balancing operations.

Initial migration attempts using MirrorMaker 2 exposed a critical limitation. The tool created duplicate topics with modified naming conventions, breaking compatibility with existing streaming applications. The TSO faced an impossible choice: accept the operational risk of parallel topic structures during migration, or invest months rewriting 40+ streaming applications. For infrastructure where grid frequency deviations must be corrected within seconds, neither path was viable.

Strategic approach

Hypothesis: If we leverage purpose-built replication technology designed for enterprise Kafka migrations, we can achieve zero-downtime cutover without application modifications
Principles: Operational continuity over speed; data integrity over convenience; architectural simplicity over complex workarounds
Operating Model: Phased migration with continuous validation, real-time lag monitoring, and instant rollback capability

100%

uptime maintained

80

principals moved to Strimzi

6

week migration

Key initiative: Zero-downtime Kafka migration

Problem → Insight

The initial MirrorMaker 2 deployment exposed a critical architectural mismatch. While MirrorMaker 2's topic prefixing could be disabled to maintain original names, this configuration only supported one-way replication. Any data produced on the Strimzi cluster wouldn't flow back to Confluent, making gradual application migration impossible. The TSO would need to migrate all 40+ applications simultaneously or risk data inconsistency between clusters.

Enabling bidirectional replication meant accepting MirrorMaker 2's default behavior: prefixed topic names that clearly identify data origin. Applications consuming from "grid.frequency.readings" would need modification to also consume from "confluent.grid.frequency.readings" and "strimzi.grid.frequency.readings." This design makes perfect sense for geo-distributed deployments where applications choose nearby clusters and understand the replication topology. But for platform migration, it meant rewriting every streaming application to handle multiple topic names for the same logical data stream.

The insight was recognizing that MirrorMaker 2 solved a different problem than the TSO faced. Multi-cluster synchronization and platform migration have fundamentally different requirements. The former assumes applications understand and adapt to cluster topology; the latter requires complete transparency. The TSO needed replication technology that treated migration as a first-class use case, maintaining exact topic structures while enabling bidirectional data flow during the transition period.

Method & framework

The solution required implementing an Active-Active replication pattern that treated both clusters as equivalent peers rather than source and target. This meant preserving not just data but the complete Kafka ecosystem: exact topic names, partition counts, consumer group offsets, and even transactional IDs. The migration would follow a deliberate progression: first replicate all data streams bi-directionally, then validate data integrity through parallel processing, run shadow workloads to verify application behavior, execute the cutover during a controlled maintenance window, and finally decommission the Confluent cluster.

Execution

Axual Distributor replaced MirrorMaker 2 as the replication layer, bringing purpose-built migration capabilities designed for enterprise requirements. Unlike MirrorMaker's namespace separation approach, the Distributor maintained exact topic structures across both clusters without requiring application awareness of the replication topology. Every topic on Confluent existed identically on Strimzi: same name, same partitions, same replication factor. When consumer groups switched clusters, the Distributor calculated appropriate offset positions in the target cluster, ensuring applications resumed processing without gaps or duplicates.

The migration wasn't just about technology but also expertise. Axual provided 24/7 incident support throughout the transition, with Strimzi experts on standby for any issues. This combination of technology and human expertise proved critical when dealing with infrastructure where minutes of downtime could destabilize the national grid. The TSO's team could focus on validating application behavior while Axual handled the complexities of cross-cluster replication and Strimzi optimization.

The validation phase ran for several weeks, with both clusters processing identical workloads while the operations team verified message integrity and system behavior. Grid management applications remained connected to Confluent while test instances validated performance on Strimzi. The Axual team monitored replication lag, adjusted configurations for optimal throughput, and provided immediate response to any anomalies. This partnership approach meant the TSO never faced migration challenges alone, having both the technology and expertise to ensure success.

Evidence

The migration's success was measurable in what didn't happen. Not a single application required code changes. Consumer groups switched clusters and continued processing without message gaps that would impact operations. The 40+ streaming applications managing everything from frequency regulation to cross-border energy trading continued operating without their development teams even knowing a migration occurred. Grid stability metrics remained within normal parameters throughout the transition. Transactional guarantees held, maintaining the exactly-once semantics critical for energy trading settlements where every message represents financial obligations between market participants.

Results

System Availability: 100% uptime maintained for both clusters throughout migration
Streaming Applications Migrated: 80 principals successfully moved to Strimzi
Applications Requiring Code Changes: Zero (only client library updates for legacy systems)
Replication Lag: Single-digit seconds for latency-critical workloads, minutes for batch processes
Daily Message Volume: Millions of events processed continuously
Migration Duration: 6 weeks total (1 month parallel running + 2-week extension)
Operational Benefit: Broker version upgrades now manageable through Strimzi automation

Closing thoughts

Migrating mission-critical Kafka infrastructure shouldn't require accepting downtime or rewriting applications. While MirrorMaker 2 provides enterprise-grade replication for multi-cluster deployments, most enterprise applications aren't built to handle multiple topic namespaces for the same logical data stream. This TSO's experience highlights a common gap: the mismatch between tools designed for multi-cluster operations and applications built for single-cluster simplicity. The Axual Distributor bridges this gap by understanding that migration is a temporary state requiring different guarantees than permanent multi-cluster architectures. Combined with 24/7 expert support, what could have been months of application rewrites became a controlled, predictable operation completed in weeks.

Further information & resources

Axual Distributor Technical Documentation

Talk to our infrastructure experts

Related case studies

Read use cases and success stories from our customers

View All

PTSB

PTSB launched real-time payments on a governance-first Kafka platform, cutting latency from 10s to <1s and setting the foundation for enterprise-wide event streaming.

Read the story

Energy traders

Energy Trading with Apache Kafka - Becoming Faster Than the Competition

Read the story

Enexis

See how Enexis uses Axual to combine Strimzi's open-source framework with enterprise features to transform smart metering with real-time data streaming.

Read the story