About the Finnish TSO
This Finnish transmission system operator manages the national power grid, ensuring reliable electricity distribution across Finland. Operating critical infrastructure that powers millions of homes and businesses, the organization processes real-time data from thousands of grid sensors, smart meters, and energy trading systems. As part of the Nordic synchronized grid system, they maintain interconnections with neighboring countries while adhering to stringent ENTSO-E regulations.
Goals & context
The Finnish TSO operated a Confluent-based Kafka infrastructure processing millions of events daily from grid sensors, energy trading systems, and smart meters. While stable, the vendor lock-in increasingly constrained their architectural flexibility and created budgetary pressure. The organization sought to regain control by migrating to Strimzi, the open-source Kubernetes-native Kafka operator.
However, this Kafka infrastructure formed the backbone of national grid management. Any service interruption could destabilize energy distribution, impact industrial operations, and potentially cascade across the Nordic synchronized grid. The TSO's operational requirements were absolute: zero downtime, zero data loss, and maintained sub-millisecond latency for grid balancing operations.
Initial migration attempts using MirrorMaker 2 exposed a critical limitation. The tool created duplicate topics with modified naming conventions, breaking compatibility with existing streaming applications. The TSO faced an impossible choice: accept the operational risk of parallel topic structures during migration, or invest months rewriting 40+ streaming applications. For infrastructure where grid frequency deviations must be corrected within seconds, neither path was viable.
Strategic approach
- Hypothesis: If we leverage purpose-built replication technology designed for enterprise Kafka migrations, we can achieve zero-downtime cutover without application modifications
- Principles: Operational continuity over speed; data integrity over convenience; architectural simplicity over complex workarounds
- Operating Model: Phased migration with continuous validation, real-time lag monitoring, and instant rollback capability
100%
uptime maintained
80
principals moved to Strimzi
6
week migration
Key initiative: Zero-downtime Kafka migration
Problem → Insight
The initial MirrorMaker 2 deployment exposed a critical architectural mismatch. While MirrorMaker 2's topic prefixing could be disabled to maintain original names, this configuration only supported one-way replication. Any data produced on the Strimzi cluster wouldn't flow back to Confluent, making gradual application migration impossible. The TSO would need to migrate all 40+ applications simultaneously or risk data inconsistency between clusters.
Enabling bidirectional replication meant accepting MirrorMaker 2's default behavior: prefixed topic names that clearly identify data origin. Applications consuming from "grid.frequency.readings" would need modification to also consume from "confluent.grid.frequency.readings" and "strimzi.grid.frequency.readings." This design makes perfect sense for geo-distributed deployments where applications choose nearby clusters and understand the replication topology. But for platform migration, it meant rewriting every streaming application to handle multiple topic names for the same logical data stream.
The insight was recognizing that MirrorMaker 2 solved a different problem than the TSO faced. Multi-cluster synchronization and platform migration have fundamentally different requirements. The former assumes applications understand and adapt to cluster topology; the latter requires complete transparency. The TSO needed replication technology that treated migration as a first-class use case, maintaining exact topic structures while enabling bidirectional data flow during the transition period.
Method & framework
The solution required implementing an Active-Active replication pattern that treated both clusters as equivalent peers rather than source and target. This meant preserving not just data but the complete Kafka ecosystem: exact topic names, partition counts, consumer group offsets, and even transactional IDs. The migration would follow a deliberate progression: first replicate all data streams bi-directionally, then validate data integrity through parallel processing, run shadow workloads to verify application behavior, execute the cutover during a controlled maintenance window, and finally decommission the Confluent cluster.
Execution
Axual Distributor replaced MirrorMaker 2 as the replication layer, bringing purpose-built migration capabilities designed for enterprise requirements. Unlike MirrorMaker's namespace separation approach, the Distributor maintained exact topic structures across both clusters without requiring application awareness of the replication topology. Every topic on Confluent existed identically on Strimzi: same name, same partitions, same replication factor. When consumer groups switched clusters, the Distributor calculated appropriate offset positions in the target cluster, ensuring applications resumed processing without gaps or duplicates.
The migration wasn't just about technology but also expertise. Axual provided 24/7 incident support throughout the transition, with Strimzi experts on standby for any issues. This combination of technology and human expertise proved critical when dealing with infrastructure where minutes of downtime could destabilize the national grid. The TSO's team could focus on validating application behavior while Axual handled the complexities of cross-cluster replication and Strimzi optimization.
The validation phase ran for several weeks, with both clusters processing identical workloads while the operations team verified message integrity and system behavior. Grid management applications remained connected to Confluent while test instances validated performance on Strimzi. The Axual team monitored replication lag, adjusted configurations for optimal throughput, and provided immediate response to any anomalies. This partnership approach meant the TSO never faced migration challenges alone, having both the technology and expertise to ensure success.
Evidence
The migration's success was measurable in what didn't happen. Not a single application required code changes. Consumer groups switched clusters and continued processing without message gaps that would impact operations. The 40+ streaming applications managing everything from frequency regulation to cross-border energy trading continued operating without their development teams even knowing a migration occurred. Grid stability metrics remained within normal parameters throughout the transition. Transactional guarantees held, maintaining the exactly-once semantics critical for energy trading settlements where every message represents financial obligations between market participants.
Results
- System Availability: 100% uptime maintained for both clusters throughout migration
- Streaming Applications Migrated: 80 principals successfully moved to Strimzi
- Applications Requiring Code Changes: Zero (only client library updates for legacy systems)
- Replication Lag: Single-digit seconds for latency-critical workloads, minutes for batch processes
- Daily Message Volume: Millions of events processed continuously
- Migration Duration: 6 weeks total (1 month parallel running + 2-week extension)
- Operational Benefit: Broker version upgrades now manageable through Strimzi automation
Closing thoughts
Migrating mission-critical Kafka infrastructure shouldn't require accepting downtime or rewriting applications. While MirrorMaker 2 provides enterprise-grade replication for multi-cluster deployments, most enterprise applications aren't built to handle multiple topic namespaces for the same logical data stream. This TSO's experience highlights a common gap: the mismatch between tools designed for multi-cluster operations and applications built for single-cluster simplicity. The Axual Distributor bridges this gap by understanding that migration is a temporary state requiring different guarantees than permanent multi-cluster architectures. Combined with 24/7 expert support, what could have been months of application rewrites became a controlled, predictable operation completed in weeks.
Further information & resources
{{tenbtn}}