On this page
Today I’d like to talk about one of the big names in open source, Apache Kafka, and tell you why the fact that it’s open source doesn’t mean that it’s free.
So what do I mean when I say that using Apache Kafka is not free, even when it’s open source?
Basically, open source when it comes to Kafka means that we’re free to download the source code of Kafka. We’re allowed to make changes to it, and compile and redistribute Kafka. And finally, we can install and run Kafka anywhere we want, for any reason. We’re not limited in any way from a license point of view. But a big portion of the cost of running a platform like Kafka comes after installation when you’re actually using it.
Kafka is easy to install and configure initially, but after that, we might need optimization for additional use cases, or when the load increases on the cluster. We need to keep an eye on CPU, Memory, and storage usage and control that in such a way as to not negatively affect the throughput and latency of producers and consumers.
Knowing what to watch in Kafka, and how to configure Kafka according to specific performance requirements requires a lot of specialized knowledge. Kafka provides a lot of configuration options on both the server and client side. And these options can have an effect on each other.
This is one of the prime reasons why it takes considerable time to be able to perform properly Kafka administration. So choosing an open-source solution like Kafka still requires the hiring and training of administrators.
Making new development tooling available
Another point where we might not have expected a cost is on the development side. But by introducing a new platform in the infrastructure we also need to think about making new development tooling available. Developers need environments where they can experiment or develop and test their applications. And often multiple environments are needed for acceptance tests. Automated test tooling that can create and clean up resources in Kafka might be needed by the development teams.
The development teams will require some time to make this available for their developers.
Suboptimal performance of the Kafka cluster
One of the costs that are most difficult to determine, comes from the suboptimal performance of the Kafka cluster or the connecting client application. The many configuration options available to the server can have an unexpected impact on the client application in terms of produce and consume latency. Scaling can also be limited when using certain consumer or topic configuration options. The operator of the application will often need to work with a Kafka administrator to determine the cause of the lack of performance and to find a way to fix those issues.
The cost of cluster, topic and data ownership
And finally one of the most underestimated costs is actually about the organization and the responsibilities of the parties involved. Usually the cluster is maintained by a dedicated team, and they monitor and control the machines, storage, and the setup of the cluster, and network.
But who is responsible for the creation and configuration of the topics?
Some configurations can have an impact on the cluster performance or even the availability of a topic. The application developers who need the topic might need to get the approval of the Kafka administrators before they can create the topic.
There’s also the matter of data ownership.
This can be technical ownership about the schemas used and who is allowed to modify them.
But for certain kinds of data, there might be confidentiality rules that apply, and you need approval from a different department before producing and consuming data on the cluster.
A lot of time can be lost just determining who is responsible for what before actual development can even begin. But usually, this improves over time, as the parties involved determine new processes to enable faster development.
Now, I may have scared you with these points, but please take a good look at them.
Almost all of them can be prevented by thinking these issues through and determining how you want to approach these situations as a company.
And you can always look for a Kafka-based platform (like Axual) that takes care of these issues or allows a standardized development and operations approach out of the box.
If you would like to discuss your Kafka project, feel free to get in touch with our Kafka experts.
Download the Whitepaper
Download nowAnswers to your questions about Axual’s All-in-one Kafka Platform
Are you curious about our All-in-one Kafka platform? Dive into our FAQs
for all the details you need, and find the answers to your burning questions.
While Apache Kafka is open source and can be freely downloaded and modified, the costs arise primarily after installation. Running Kafka requires ongoing administration, optimization, and monitoring to ensure performance and efficiency. This often necessitates hiring skilled administrators, implementing development tooling, and managing complex configurations, all of which contribute to operational costs.
Effective Kafka administration requires specialized knowledge in configuration management, performance monitoring, and troubleshooting. Understanding the various configuration options for both server and client sides is essential, as they can significantly impact throughput, latency, and overall cluster performance. Organizations may need to invest time and resources in hiring and training staff to ensure optimal operation of their Kafka environments.
Hidden costs can include the need for development environments, automated testing tools, and the management of topic and data ownership. These costs may stem from the complexity of coordinating between development teams and Kafka administrators, ensuring proper configurations, and navigating responsibilities related to data confidentiality and schema management. Establishing clear processes and utilizing Kafka-based platforms can help mitigate these costs and improve development efficiency.
Related blogs
Apache Kafka has become a central component of modern data architectures, enabling real-time data streaming and integration across distributed systems. Within Kafka’s ecosystem, Kafka Connect plays a crucial role as a powerful framework designed for seamlessly moving data between Kafka and external systems. Kafka Connect provides a standardized, scalable approach to data integration, removing the need for complex custom scripts or applications. For architects, product owners, and senior engineers, Kafka Connect is essential to understand because it simplifies data pipelines and supports low-latency, fault-tolerant data flow across platforms. But what exactly is Kafka Connect, and how can it benefit your architecture?
Apache Kafka is a powerful platform for handling real-time data streaming, often used in systems that follow the Publish-Subscribe (Pub-Sub) model. In Pub-Sub, producers send messages (data) that consumers receive, enabling asynchronous communication between services. Kafka’s Pub-Sub model is designed for high throughput, reliability, and scalability, making it a preferred choice for applications needing to process massive volumes of data efficiently. Central to this functionality are topics and partitions—essential elements that organize and distribute messages across Kafka. But what exactly are topics and partitions, and why are they so important?
Strimzi Kafka offers an efficient solution for deploying and managing Apache Kafka on Kubernetes, making it easier to handle Kafka clusters within a Kubernetes environment. In this article, we'll guide you through opening a shell on a Kafka broker pod in Kubernetes and listing all the topics in your Kafka cluster using an SSL-based connection.