Spring Boot HandBook

    Kafka Architecture: A Comprehensive Overview

    Apache Kafka is a distributed event streaming platform designed for high-throughput, fault-tolerant, and scalable data pipelines. Its robust architecture is the foundation of its ability to handle millions of messages per second with low latency. In this blog, we’ll dive deep into the key components and concepts of Kafka’s architecture.

    Understanding Kafka Architecture#

    From the belove image you can see that we will be having multiple producers, consumers as a microservices where producer will produce the data and consumer will cosume that data when it is available. Here Kafka cluster is the middleman which will take the producers data and consumer will take the data from kafka. It is important for us to understand the Kafka architecture to know how the data which was produced by producer is reached till consumer to consume it. So first let’s understand the core components involved in the Kafka Architecture

    Kafka Architecture

    Key components of Kafka Architecture#

    Kafka's architecture is built around the following core components:

    1. Topics#

    • Kafka organizes messages into topics, which are categories or feeds where data is stored.
    • Producers send data to topics, and consumers read data from them.
    • Topics are divided into partitions, allowing Kafka to scale horizontally.

    2. Partitions#

    • Each topic is split into partitions, which are distributed across Kafka brokers.
    • Partitions enable parallelism, as multiple consumers can process data from different partitions simultaneously.
    • Messages in a partition are ordered and identified by an offset, a unique sequence number.

    3. Producers#

    • Producers are clients or applications that publish data to Kafka topics.
    • They determine the partition for each message using a partitioning strategy (e.g., round-robin or based on a key).
    • Producers can choose between different acknowledgment levels for message delivery (e.g., no acknowledgment, leader acknowledgment, or acknowledgment from all replicas).

    4. Consumers#

    • Consumers are applications or services that subscribe to topics to process data.
    • They are part of a consumer group, where each group collectively reads data from a topic.
    • Kafka ensures that each partition is assigned to only one consumer in a group, enabling parallel consumption.

    5. Brokers#

    • Kafka brokers are the servers that store data and manage client requests.
    • Each broker handles a portion of the topic partitions and ensures data replication and reliability.
    • Brokers communicate with each other to maintain the cluster's state and balance workloads.

    6. ZooKeeper/ Kafka Raft (KRaft)#

    • Historically, Kafka used ZooKeeper for cluster coordination, leader election, and metadata management.
    • Kafka is transitioning to Kafka Raft (KRaft), a built-in consensus protocol that removes the dependency on ZooKeeper while improving scalability and fault tolerance.

    7. Replication#

    • To ensure fault tolerance, Kafka replicates data across brokers.
    • Each partition has one leader and multiple followers. Producers and consumers interact with the leader, while followers replicate the data for redundancy.
    • If a leader fails, one of the followers is promoted to leader.
    Kafka Cluster

    Understanding the flow#

    From the below image you can see that Kafka topics are subdivided into partitions, which are the basic unit of parallelism in Kafka. Each partition is an ordered, immutable sequence of records.

    Kafka Broker

    Partition allow for scalability, as data can be distributed across multiple brokers, enabling horizontal scaling and concurrent consumption by multiple consumers

    Let’s understand this with example :-

    Imagine a topic named “user-created-event” with 3 partitions (P0, P1, P2) and 2 brokers:

    1. Partition Distribution: Broker 1 hosts P0 and P1 Broker 2 hosts P2
    2. After adding one more broker Broker 1 hosts P0 Broker 2 hosts P1 Broker 3 hosts P2 This spreads the workload across all brokers, enabling the cluster to handle more data.
    3. Concurrent Consumption: C1 consumes P0 C2 consumes P1 C3 consumes P2 The consumers process data concurrently, improving throughput.

    By increasing partitions and brokers, Kafka can handle larger data volumes without a single broker becoming a bottleneck and more partitions allow more consumers in a group to work simultaneously, improving processing speed and reducing lag.

    Understanding Production of the message/event of topic by producer#

    When the producer produse the data related to particular topic the data is stored with in a partition and a unique incremental offset is assigned to the data with in the partition which will start from zero. Now as we know that each topic has many partitions then at which partition the data will be stored?, basicall kafka stores data at partion in round-robin fashion or based on a key.

    Producer don’t only send the data to partition in broker it also takes the acknowledgements from the broker that broker has received the produced data.

    The acks configuration determines the number of acknowledgments the producer requires from the broker before considering a request complete. It has three possible settings: 0, 1 and all (-1)

    1. 0: The producer does not wait for any acknowledgment from the broker, fastest, but offers no guarantee that the message has been received or written to the broker.
    2. 1: The producer will receive an acknowledgment as soon as the leader broker has received the message.
    3. All (-1): The producer will wait for acknowledgments from all in-sync replicas (ISRs) before considering the message sent.

    If a send request fails (for example, due to a temporary network issue or a broker being down), the producer will automatically retry sending the message up to the number specified by the retries configuration. This is useful for ensuring that transient issues do not result in message loss.

    When combined with acks=all, the producer can retry sending messages that have not been acknowledged by all replicas. This ensures that even if there are issues with one of the brokers, the message can still be sent to the other available brokers.

    Understanding Partition Offset#

    Data with in a partition is assigned a unique, incremental offset to track the order of messages.

    Offset is a unique identifier assgined to each record within a partition. Offsets are sequential integers that mark the partition of a message in a partion. Each record within a partition has a unique offset, starting from 0.

    Consumers can specify where they want to start reading by providing an offset (e.g the most recent offset, or offset 0 to read from the beginning).

    The writing of the data by producer will always happen at the end only and offset will keep on incrementing it will never reset.

    Kafka Topic Partition

    Understanding the consumption of message/event by consumer#

    Unlike the other pub/sub implementations, Kafka doesn’t push messages to consumers. Instead, consumers have to pull messages off kafka topic partitions. A consumer connects to a partition in a broker, reads the messages in the order in which they were written by producer.

    By remembering the offset of the last consumed message for each partition, a consumer can join a partition at the point in time they choose and resume from there. That is particularly useful for a consumer to resume reading after recovering from a crash.

    Partition Offset

    But this may crate a problem where multiple consumers instance of the same type read the record of a Kafka topic. To avoid this, kafka has a concept called Consumer Groups.

    Kafka Consumer Groups

    The consumer group concept ensures that a message is only ever read by a single consumer in a group. When a consumer group consumes the partitions of a topic, Kafka makes sure that each partition is consumed by exactly one consumer in the group.

    The maximum parallelism of a group will be equal to the number of partitions of that topic. The number of consumer don’t govern the degree of parallelism of a topic. It’s the number of partitions.

    Rebalancing of Partitions among consumers#

    Rebalancing is the re-assignment of partition ownership among consumers within a given consumer group, ensuring that every consumer in the group is assigned one or more partitions. Rebalancing occurs in the following scenarios:

    • A new consumer joins the consumer group.
    • An existing consumer goes down.
    • New partitions are added.
    • An existing consumer is deemed dead by the Group Coordinator.

    The first consumer that joins a consumer group is referred to as the Group Leader of that consumer group.

    Understanding Partition Replication#

    Replication is making a copy of a partition available in another broker. Replication enables Kafka to be fault tolerant. When a partition of the topic is available in multiple brokers then one of the partitions in a broker is elected as leader and rest of the replication of partition are followers.

    Partition Replication

    Kafka Brokers Management#

    Different brokers store partition replicas, with one of the brokers being the designated leader for each partition. A controlling mechanism that keeps track of the state of topics, partitions, and brokers is needed for such an architecture to work.

    Kafka now uses the Raft consensus protocol to manage the metadata and the state of the Kafka brokers.

    In KRaft mode, brokers coordinate directly with each other to elect a leader and replicate state across the cluster, eliminating the need for a separate system like Zookeeper.

    Kafka Log Retention#

    Kafka's log retention controls how long messages are kept in a topic. Messages are stored on disk, and Kafka can retain them based on time (retention.ms) or size (retention.bytes), whichever comes first. Once the retention limit is reached, old messages are deleted in the background. Kafka can also be configured to retain messages indefinitely (retention.ms=-1), but this could lead to high storage costs if not managed carefully.

    Conclusion#

    Kafka's architecture is a masterclass in distributed systems design. Its ability to handle massive volumes of data with low latency and high fault tolerance makes it the backbone of many real-time and batch processing systems. By understanding Kafka’s architecture, you can harness its full potential to build scalable and resilient data pipelines.

    Last updated on Jan 14, 2025