Error Handling in Kafka Producer
Kafka is one of the popular open-source data streaming tool that is used for multiple use-cases. The main idea is to implement the event-driven architecture to process and distribute the data in a scalable and real-time way. Since you are working with data, it is important to implement a consistent data integration, and from this point of view, I will mention how to tackle unexpected situations on streaming projects like invalid data, exceptions etc
Recap for Kafka Producer
Kafka producer is an API that allow to send data to Kafka. Producer writes data to the Kafka and Kafka holds this message for some time. Please see how publisher writes data to messaging layer;
As you see, there are three components that are used in the flow;
- Source system
- Kafka producer
- Messaging Layer ( Kafka Broker)
The following code represents how to publish data to Kafka broker via java
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("acks", "all");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
for (int i = 0; i < 100; i++)
producer.send(new ProducerRecord<String, String>("my-topic", Integer.toString(i), Integer.toString(i)));
producer.close();
Ref : KafkaProducer
Let’s deep dive how to tackle unexpected situations in Kafka Producer
ERROR HANDLING IN PRODUCER
Error handling is very important topic when it comes to play data accuracy. Let’s list down all the problem that you can face while publishing your data to Kafka broker
- Invalid (bad) data is received from source system
- Kafka broker is not available while publishing your data
- Kafka Producer could be interrupted
- Kafka Topic couldn’t be found
- Any exception can be happened during the data processing
For every item, I will deep dive to the problem and explain how you can design your implementation
Invalid (bad) data
The data is coming with wrong format/content what producer is expected. In that case, the recommendation would be to skip this data, but it would be better to log a specific queue to be reported afterwards
In that case, source team can identify what the issue is happening for specific records, hence they can fix the problem and regenerate the records
Producer Interrupted
The producer allows to send data to the Kafka broker. If any issue happens, the data will be loosed which is in producer memory. If your producer skip the record and won’t re-send same data to Kafka Broker, there would be a data inconsistency
In order to make sure what part of data is published to Kafka, you can put a flag for successfully published data
When you restart the Kafka producer, you can see what data couldn’t be published
Invalid (bad) data after processing data
The data is coming in a correct format but due to dependencies, some of data is not created in a right way. For example, you need to enrich geo location based on IP and service is not responding for some of IPs.In that case, you can publish this event to “retry queue” to be reprocessed afterwards
Kafka broker is not available
Producer is trying to publish data to Kafka but the connection to broker is not established because broker is not available. In that case, producer should retry to publish data within some time intervals.
Another important point is that your monitoring system should send an alert in order to make Kafka broker available. Hence, DevOps team can restart the Kafka broker or find the real issue to keep it available
Kafka Topic couldn’t be found
Kafka records are stored in the topics. If the topic is not found, producer gets “topic not found” error. The use case is very similar with the previous item. Hence you can take a similar action in terms of retrying and sending an alert.
Conclusion
As you see, you can face different issues in the production environment which reduce the data accuracy. You need to tackle this kind of issues, for sure, you need to put more effort in terms of implementation but all of this additional implementations and efforts accelerate your SRE cost in the production environment.