By Jimmy Kusters


Kafka Monitoring: Using Metrics to monitor your Empire

What Kafka is has been further detailed in previous articles, but suffice to say that Apache Kafka is a “fault-tolerant event streaming platform”. It aims at reliability, scalability and high throughput, which performs well while being able to handle outages to a certain extent.

Being open source means that it is allowed to be used without cost for the software itself, and is maintained by enthusiasts and professionals alike on a global scale.


Why should you want to monitor your environment?

Being ahead of the curve would be a place where anyone would like to be. The same goes for monitoring. It will make sure you will be aware that something happens when it happens (or at least shortly after), and not because of an signal “from the outside”, which makes you lose credibility. It would be safe to say monitoring adds business value because of the following:

  • You’re spending less time on checking for “optimum performance”. You set the boundaries at the building/testing stage, and put up alerts to get notified when those boundaries are exceeded while in production. This will give you and your colleagues more time to spend on creating new value, instead of maintaining status quo with what you already have. Standing still equals going backwards.
  • Detecting incidents when they occur, maybe even before there is customer impact, and have mitigation in place, or even components replaced, before there would be.
  • Reducing the (re)solution-time of an incident when (not “if”) it does occur. Almost instantly knowing which piece of the puzzle is the root cause of a situation will drastically prevent or at least reduce loss of reputation and revenue.

What should be monitored?

At the core Kafka is all about Input and Output. If you would compare the mechanic to, say, a shipping company, you’d receive a parcel through transit, store it in a distribution center, keeping redundant copies of the package to prevent loss, maybe fly it to a different country, store it again, keeping redundant copies of the package again, make it available for pick up, then transit it some more. This makes for the monitoring being pretty much solidified throughout “the stack”. It’s all about I/O speeds, queues and buffers… All the while keeping track where the various readers (ie. the ones picking up the parcels) are currently at with opening up those parcels they’ve picked up at your distribution center, so you can quickly hand over the correct following parcel once they come back. Plus, to keep the internals in order,  (copies of) old packages no one wants anymore will have to be cleaned up, after the agreed upon retention time has expired!

Performance of all these components is required. For instance, if the first distribution center is not handling parcels quickly enough it might run out of storage room, workers on the second distribution center will just be sitting idle without generating revenue, and angry customers will be waiting for their promised parcels.

Translating this into technical terms: at the entry-level there’s monitoring on all the components: 

  • Resource-utilization of hosts running the technical platform (storage space, traffic on access roads and available workers)
  • Resource-utilization of components making up the functional cluster (are procedures followed and parcels handled correctly)
  • Cluster-statuses, -performance and data-availability on functional cluster (can customers come in to drop parcels or do they need to divert to another one, are distribution centers on par with the expected performance, are duplicates available)
  • Message-throughput, -distribution and metadata of topics inside the functional cluster (are the right parcels going to the right destination, at the right time)

Metrics provided

Here’s some examples of metrics which are provided out of the box:

Resource-utilization platform

  • CPU-, Network-, Memory-utilization
  • Number of Disk I/O operations, and time spent doing this
  • Open file descriptors


Resource-utilization cluster-components

  • CPU-, Network-, Memory-utilization per component
  • Write-to-disk lag
  • Various message-, offset- and schema-Distribution statistics.


Cluster-statuses, -performance and data-availability

  • Message rate and size in/out
  • expand/shrink occurences in the cluster
  • Under Replicated Partitions
  • Various message-, offset- and schema-Distribution statistics.

This can be extended further with a limited set of producer- and consumer-statistics (as seen on the cluster, like the “consumer lag” for instance). Please note that specific details of producers and consumers will have to be monitored on said producers and consumers by yourself, in any way you see fit. It would be advisable to do this in a similar way to be able to correlate the information provided by various sources.


What is supplied by Axual?

In the Axual basic setup we provide a monitoring solution out-of-the-box, consisting of multiple pieces of software to hook into the used components, a time-series database (TSDB) named Prometheus, the accompanying AlertManager module to trigger alerts, and a presentation/graph-tool called Grafana to distill raw data into sensible information. 

Our deployment framework keeps track of the various components for you, making sure to tell Prometheus to collect all their performance metrics when they’re started. If you don’t like it, you are free to replace it with your own solution if you wish to do so!


Download our whitepaper

Want to know how we have build a platform based on Apache Kafka, including the learnings? Fill in the form below and we send you our whitepaper.

Event Streaming for the Energy Industry

Hidden costs & risks of implementing Apache Kafka