Setting Up a Self-Managed Kafka Cluster with KRaft

Modern Kafka Deployment without ZooKeeper

February 7, 2026•Streaming

Kafka KRaft (Kafka Raft) mode removes the dependency on ZooKeeper, simplifying the architecture by handling metadata management within Kafka itself. This guide covers a 3-node cluster setup.

Why Kafka is Moving Away from ZooKeeper

For over a decade, Apache Kafka relied on Apache ZooKeeper for critical cluster coordination tasks. However, this dependency introduced several operational challenges that the KRaft architecture addresses:

Operational Complexity

Required managing a separate distributed system alongside Kafka
Additional deployment, monitoring, and maintenance overhead
ZooKeeper-specific expertise needed for troubleshooting

Scalability Limitations

ZooKeeper struggles with millions of partitions
Metadata operations become bottlenecks at scale
Controller failover could take minutes in large clusters

Security & ACL Management

Dual security configuration: both Kafka and ZooKeeper
Inconsistent access control models
Increased attack surface area

Performance Overhead

Network round-trips between Kafka and ZooKeeper
Metadata propagation delays across two systems
Limited throughput for metadata updates

KRaft Architecture & Controller Quorum

KRaft (Kafka Raft) is Kafka's native consensus protocol implementation, eliminating external dependencies and bringing metadata management directly into Kafka brokers.

Core Architecture Components

Controller Quorum

A dedicated set of nodes that form a Raft consensus group to manage cluster metadata. The quorum uses a leader-based replication model:

Active Controller (Leader): Handles all metadata writes (topic creation, partition assignment, configuration changes)
Standby Controllers (Followers): Replicate the metadata log and can take over instantly if the leader fails
Quorum Size: Typically 3 or 5 nodes (majority required for consensus)
Metadata Log: Stored as a special internal topic __cluster_metadata

Process Roles

Controller Only

Dedicated metadata management. Ideal for large clusters requiring high metadata throughput.

Broker Only

Handles data plane operations (producing, consuming, replication). Most nodes in production.

Combined (Broker + Controller)

Dual role mode. Suitable for smaller clusters (3-5 nodes). This guide uses combined mode.

How Controller Election Works

Controllers participate in Raft leader election
One controller becomes the active leader (requires majority votes)
All metadata mutations go through the leader
Leader replicates changes to follower controllers via Raft protocol
Changes are committed once acknowledged by majority (quorum)
If leader fails, remaining controllers elect a new leader within seconds

Operational Benefits of KRaft

⚡ Faster Controller Failover

Failover time reduced from minutes to seconds. In ZooKeeper-based clusters, controller failover required full metadata reload. KRaft controllers maintain hot standbys.

📈 Improved Scalability

Support for millions of partitions. Metadata operations are now event-driven and stored in a compacted log, enabling linear scaling.

🔧 Simplified Operations

Single system to deploy, monitor, and maintain. No separate ZooKeeper ensemble. Unified security model and configuration.

🚀 Better Metadata Propagation

Brokers consume metadata directly from the __cluster_metadata topic instead of polling ZooKeeper, resulting in near-instant propagation.

💾 Deterministic Metadata State

Metadata is stored in an immutable, append-only log with snapshots. Makes auditing, debugging, and disaster recovery significantly easier.

Limitations & Considerations

⚠️
No Migration Path from ZooKeeper: You cannot directly migrate a ZooKeeper-based cluster to KRaft. You must set up a new KRaft cluster and migrate topics/data.
⚠️
Minimum Kafka Version: KRaft became production-ready in Kafka 3.3+. Kafka 3.5+ is recommended for stability.
⚠️
Controller Resource Requirements: In combined mode, ensure nodes have sufficient CPU/memory for both controller and broker workloads.
⚠️
Quorum Size Planning: Use odd numbers (3, 5, 7). A 3-node quorum tolerates 1 failure; 5-node quorum tolerates 2 failures.
⚠️
Tooling Compatibility: Ensure third-party tools and monitoring systems support KRaft mode (many older tools assume ZooKeeper).

Production Best Practices

🏗️ Architecture Planning

Small Clusters (3-5 nodes): Use combined mode (broker + controller) for simplicity
Large Clusters (10+ nodes): Separate controller nodes from broker nodes for isolation and performance
Dedicated Controllers: 3 controller-only nodes + N broker-only nodes for production-grade clusters

🔒 Security Hardening

Enable TLS encryption for controller and broker listeners
Use SASL authentication (SCRAM-SHA-512 recommended)
Implement network segmentation: Keep controller quorum on a private network
Enable authorization with ACLs for production workloads

📊 Monitoring & Observability

Monitor kafka.controller:type=KafkaController,name=ActiveControllerCount (should be 1)
Track kafka.server:type=raft-metrics for quorum health and lag
Alert on controller election frequency (frequent elections indicate instability)
Monitor __cluster_metadata topic size and compaction

💾 Storage & Performance Tuning

Use SSD storage for metadata log directories (log.dirs)
Configure metadata.log.max.record.batch.size.bytes based on cluster size
Set metadata.log.segment.ms and metadata.log.segment.bytes for snapshot management
Tune controller.quorum.fetch.timeout.ms and controller.quorum.request.timeout.ms for network conditions

🔄 Disaster Recovery

Regularly backup metadata snapshots from __cluster_metadata-0 directory
Test controller failover scenarios in staging environments
Document the Cluster ID securely (required for disaster recovery)
Plan for quorum recovery (losing majority requires manual intervention)

🎯 Quick Decision Guide

✅ Use KRaft When:

Starting a new Kafka deployment
Need to scale beyond 100k partitions
Want simplified operations
Require fast failover (<10 seconds)

⏸️ Stick with ZooKeeper When:

Running Kafka < 3.3
Using tools incompatible with KRaft
Have existing ZK-based clusters (until migration strategy exists)

Step-by-Step Setup Guide

Cluster IP Assignments

192.168.0.111 - k1
192.168.0.112 - k2
192.168.0.113 - k3

Step 1: Download and Extract Kafka

Download the latest Kafka binaries from the official website and extract them to /opt.

# Download Kafka (Scala 2.13 version)
wget https://downloads.apache.org/kafka/3.9.0/kafka_2.13-3.9.0.tgz

# Extract to /opt
sudo tar -xvzf kafka_2.13-3.9.0.tgz -C /opt
sudo ln -s /opt/kafka_2.13-3.9.0 /opt/kafka

# Create dedicated user
sudo adduser kafka
sudo chown -R kafka:kafka /opt/kafka_2.13-3.9.0

Step 2: Configure KRaft (server.properties)

Each node needs a unique node.id and the controller.quorum.voters string must match across all nodes.

On k1 (192.168.0.111):

# Edit /opt/kafka/config/kraft/server.properties
process.roles=broker,controller
node.id=1

# Listeners
listeners=PLAINTEXT://192.168.0.111:9092,CONTROLLER://192.168.0.111:9093
inter.broker.listener.name=PLAINTEXT
advertised.listeners=PLAINTEXT://192.168.0.111:9092
controller.listener.names=CONTROLLER

# Controller Quorum Voters
controller.quorum.voters=1@192.168.0.111:9093,2@192.168.0.112:9093,3@192.168.0.113:9093

# Log Directory
log.dirs=/var/lib/kafka/data

Note: Repeat for k2 and k3, updating node.id to 2 and 3, and adjusting the listener IPs accordingly.

Step 3: Cluster ID Generation & Formatting

All nodes in a Kafka cluster must share the same Cluster ID. Generate it on your first node and distribute it to others.

# 1. Generate a Cluster UUID (On Node k1 ONLY)
KAFKA_CLUSTER_ID="$(/opt/kafka/bin/kafka-storage.sh random-uuid)"
echo $KAFKA_CLUSTER_ID
# Example Output: y9icjMjBRrS5gjfgSaea_Q

# 2. Export the ID on ALL nodes (k1, k2, k3)
# Replace with the actual ID generated in step 1
export KAFKA_CLUSTER_ID="y9icjMjBRrS5gjfgSaea_Q"

# Ensure it's set correctly
echo $KAFKA_CLUSTER_ID

# 3. Format Log Directories (On ALL nodes)
/opt/kafka/bin/kafka-storage.sh format \
  -t $KAFKA_CLUSTER_ID \
  -c /opt/kafka/config/kraft/server.properties

Step 4: Automating with Systemd

Create a service file to manage the Kafka process.

# /etc/systemd/system/kafka.service
[Unit]
Description=Apache Kafka Server
After=network.target

[Service]
Type=simple
User=kafka
Group=kafka
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/kraft/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
Restart=on-failure

[Install]
WantedBy=multi-user.target

Starting the Cluster

sudo systemctl daemon-reload
sudo systemctl enable kafka
sudo systemctl start kafka

Step 5: Verification

Verify the cluster is healthy by creating a topic and producing/consuming messages.

# Create a Topic
/opt/kafka/bin/kafka-topics.sh --create --topic test-events --bootstrap-server 192.168.0.111:9092 --partitions 3 --replication-factor 3

# List Topics
/opt/kafka/bin/kafka-topics.sh --list --bootstrap-server 192.168.0.111:9092

# Produce Messages
/opt/kafka/bin/kafka-console-producer.sh --topic test-events --bootstrap-server 192.168.0.111:9092

# Consume Messages (From another terminal)
/opt/kafka/bin/kafka-console-consumer.sh --topic test-events --from-beginning --bootstrap-server 192.168.0.111:9092

Critical Configurations

advertised.listeners: This must be the IP reachable by clients.
num.network.threads & num.io.threads: Adjust based on CPU cores.
group.initial.rebalance.delay.ms: Set to 3000 in production to prevent frequent rebalancing.
offsets.topic.replication.factor: Set to 3 for high availability.