Technology 25 Jul 2022

Change Data Capture with PostgreSQL, Debezium and Axual Part 1 – Preparing PostgreSQL

Most companies rely heavily on the data stored in their databases. If data isn’t stored in the correct database then business processes can be corrupted and you might make decisions based on outdated or incorrect information. This makes the data and changes made to it an interesting source of events. But getting data from a database into a streaming platform has its challenges. This blog will explain how to prepare PostgreSQL for Change Data Capture with Debezium on the Axual Platform.

PostgreSQL, or Postgres, is an open source relational database management system and one of the most popular databases to use whose popularity comes from the fact that it can be easily set up as a single machine, but can also be set up for massive scales and for concurrent access by multiple users.
PostgreSQL can be found in big and small companies because of the features, performance and the fact that most cloud providers have a PostgreSQL Database as a Service (DBaaS) offering.
But getting data from PostgreSQL and loading it into Axual Platform, or another Kafka based streaming platform, can be challenging.
Change Data Capture is an approach often considered to be too complex, but has some distinct advantages when compared to other solutions.

Assessing the possible solutions

There are several solution available to architects and developers to get data from a database into a streaming platform, each with its own advantages and disadvantages.

Solution 1 – Change Data Capture

Change Data Capture is a pattern where a process identifies and captures changes made to data. It then sends the captured changes to another process or system which can process the changes.
CDC solutions use the internal change tracing features of databases allowing them to capture and produce record deletion events as well as record insertions and updates. A lot of databases also allow transactions and changes to the tables definitions to be captured.
Debezium is a CDC tool for Kafka based systems that can connect and process changes from a relatively wide range of databases and is packaged as a Kafka Connect source connector.
An advantage of using CDC is that it can capture inserts, updates and deletes of records, as well as tracking transactions and sometimes datamodel changes.
A disadvantage is that CDC tools often support only a limited number of database vendors, as specific internal knowledge is required to connect and use the internal change tracking features of a database.

Solution 2 – Kafka Connect with JDBC Source Connectors

Kafka Connect is of course one of the default solutions to investigate to get data from an external system to Kafka and back.
JDBC, or Java Database Connectivity, is an API that allows applications to access a database in a generic way, without requiring knowledge of the native protocols used by the database.
Database vendors often provide JDBC Drivers, which translate the JDBC API calls to the database specific format to handle connections and queries.
There are JDBC Source connectors available for Kafka Connect that can read from tables and that regularly perform SQL queries to determine if there are changes. These changes are then produced to the topics by Kafka Connect.
A major advantage of these connectors is the fact that they are generic. The solution works on most databases, because most database vendors supply JDBC Drivers.
A disadvantage is that the change capture capabilities are limited by the use of SQL statements. Tables need a special column to determine for each record if and when it was changed, and that you cannot catch a deletion of a record in a table without additional logic in the database, such as triggers.

Solution 3 – Developing a custom application

A custom application can be developed that connects to the database, scans the schemas, tables and data inside the tables, and writes the results to one or more Kafka topics.
The greatest advantage of this solution is that since this application is custom made it can be optimised for a specific database and use any database specific feature.
A disadvantage of developing a custom application is that is can become very complex and hard to maintain as it is also very hard to design and implement an application like this to be reusable and scalable.
Another disadvantage of this approach is that it almost impossible to determine if a record was deleted from a table, unless the database offers special features for that.

Example use case with Change Data Capture

Most organisations have a process for reporting and processing incident reports and in our use case the incident reports are created and updated by an application that stores the data in a PostgreSQL database.
Change Data Capture will be used to capture the creation, updates and deletion of reports in the database and publish them on the Axual Platform to allow other systems to consume and process the events.

PostgreSQL and Axual

The table IncidentReports from the Incidents schema will be read by the Debezium Connector in Kafka Connect, which will load any new, updated or deleted data entries to the appropriate topic in the Kafka Cluster.
The incident report changes should go to the topic cdc-incident-reports.
The transaction metadata events that are captured from the database should be sent to the topic cdc-transactions.

PostgreSQL connected with Debezium and Axual

Preparing the PostgreSQL database

The Debezium connector for PostgreSQL can scan schemas for changes in tables by using the PostgreSQL Logical Decoding feature. This feature makes it possible to extract the data changes from the transaction log and process these changes with a plugin.
PostgreSQL versions 10 and later already have a default plugin installed, called pgoutput.
An alternative approach is using the decoderbufs and wal2json plugins.
The database administrator/operator should be involved in the selection and installation of these plugins, as they can have an operational impact.

For this guide the default pgoutput plugin will be used.

Prerequisites

The following resources are needed to prepare the database for the Change Data Capture example:

Creating the database schema and table

For these steps the SQL client is required. The example code will use the psql client on the database server.

Creating the replication user and setting ownership

PostgreSQL has specific requirements of the grants, or permissions, needed by a user to perform replication like Debezium uses.
These grants are:

Most of these requirements can be met by granting them to the user, but changing the ownership of a schema or table can disrupt other activities on the database. Groups can be created to prevent that. The group will own those resources, and the new replication user and the original owner will be members of that group. This ensures that there will be no unintended disruptions for other systems.

Enabling Logical Decoding and allowing connections

Several changes are required to the PostgreSQL configuration files to enable the Logical Decoding feature and to allow replication connections.

Next steps

After completing the steps in this article the database is prepared for connections from Debezium.
The next article will be focussed on using Debezium for PostgreSQL in the Axual Platform.

Other blogs

Apache Kafka 2 weeks ago

Understanding Kafka: Message Size, Producer Examples, and Consumer Groups

Understanding Kafka can seem challenging, but in this blog, we simplify the concepts of Kafka’s maximum message size, how to use Kafka producers, and what consumer groups do. Ideal for beginners and those looking to expand their knowledge.

Richard Bosch
Apache Kafka 3 weeks ago

Use Case | Logius legacy modernization for Dutch government  

Logius, with CGI and Axual, modernizes Dutch government communication using a scalable Kafka platform for efficient, secure, and future-proof digital services, streamlining interactions between government, citizens, and businesses.

Richard Bosch
Apache Kafka 3 weeks ago

Kafka Operator and linger.ms in Apache Kafka

Linger.ms in Kafka optimizes batch sending delays, balancing throughput and latency. Kafka Operators help manage this setting in Kubernetes, simplifying configuration and performance tuning for efficient data handling.

Richard Bosch

Apache Kafka is great, but what do you do
when great is not good enough?
See what Axual offers on top of Kafka.

Start your free trial
No credit card required