Introduction
Apache Kafka is an open source distributed event and stream processing platform written in Java and built for processing real-time data feeds. It is inherently scalable, with high throughput and availability. Developed by the Apache Software Foundation, Kafka has been widely adopted for its reliability, ease of use, and fault tolerance. It is used by the world's largest organizations to manage large volumes of data in a distributed and efficient manner.
In this tutorial, you will download and set up Apache Kafka. You will learn how to create and delete topics, as well as send and receive events using the provided scripts. You will also learn about similar projects with the same goal and how Kafka compares.
Prerequisites
- A machine with at least 4 GB of RAM and 2 CPUs. In the case of Ubuntu Server
- Java 8 or higher installed on your Droplet or local machine.
Step 1 – Download and Configure Apache Kafka
In this section, you will download and extract Apache Kafka on your machine. For added security, you will set it up under your own user account. Then, you will configure and run it using KRaft.
First, you create a separate user that Kafka will run under. Create a user named kafka by running the following command:
sudo adduser kafkaYou will be asked for your account password. Enter a strong password and skip filling in additional information by pressing ENTER for each field.
Finally, switch to the specific Kafka user:
su kafkaNext, you will download the Kafka release package from the official downloads page. At the time of writing, the latest version was 3.7.0, built for Scala 2.13. If you are using macOS or Linux, you can download Kafka with curl.
Use this command to download Kafka and place it in /tmp:
curl -o /tmp/kafka.tgz https://downloads.apache.org/kafka/3.7.0/kafka_2.13-3.7.0.tgzYou will store the version under ~/kafka, in the main directory. Create it by running:
mkdir ~/kafkaThen extract it to ~/kafka by running:
tar -xzf /tmp/kafka.tgz -C ~/kafka --strip-components=1Since the archive you downloaded contains a root folder with the same name as the Kafka version, –strip-components=1 will skip it and extract everything in it.
At the time of writing, Kafka 3 was the last major release to support two systems for metadata management: Apache ZooKeeper and Kafka KRaft (short for Kafka Raft). ZooKeeper is an open source project that provides a standard way to coordinate distributed data for applications, also developed by the Apache Software Foundation.
However, starting with Kafka 3.3, support for KRaft was introduced. KRaft is a purpose-built system for coordinating only Kafka instances, simplifying the installation process and allowing for much greater scalability. With KRaft, Kafka itself takes full responsibility for the data instead of keeping administrative metadata externally.
While still available, ZooKeeper support is expected to be removed from Kafka 4 and beyond. In this tutorial, you will set up Kafka using KRaft.
You need to create a unique identifier for your new Kafka cluster. Currently, it consists of only one node. Go to the directory where Kafka now lives:
cd ~/kafkaKafka with KRaft stores its configuration in config/kraft/server.properties, while ZooKeeper's configuration file is config/server.properties.
Before running it for the first time, you need to override some of the default settings. Open the file for editing by running:
nano config/kraft/server.propertiesFind the following lines:
...
############################# Log Basics #############################
# A comma separated list of directories under which to store log files
log.dirs=/tmp/kafka-logs
...The log.dirs setting specifies where Kafka keeps its log files. By default, it stores them in /tmp/kafka-logs because they are guaranteed to be writable, albeit temporarily. Replace the value with the specified path:
... ############################# Log Basics ############################# # A comma separated list of directories under which to store log files log.dirs=/home/kafka/kafka-logs ...
Since you created a separate user for Kafka, you will place the log directory path under the user's home directory. If it doesn't exist, Kafka will create it. When you are done, save and close the file.
Now that you have configured Kafka, run the following command to generate a random cluster ID:
KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"Then create storage space for the log files by running the following command and entering the ID:
bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.propertiesThe output will be:
Output
Formatting /home/kafka/kafka-logs with metadata.version 3.7-IV4.Finally, you can start the Kafka server for the first time:
bin/kafka-server-start.sh config/kraft/server.propertiesThe end output will be similar to this:
Output
...
[2024-02-26 10:38:26,889] INFO Awaiting socket connections on 0.0.0.0:9092. (kafka.network.DataPlaneAcceptor)
[2024-02-26 10:38:26,890] INFO [BrokerServer id=1] Waiting for all of the authorizer futures to be completed (kafka.server.BrokerServer)
[2024-02-26 10:38:26,890] INFO [BrokerServer id=1] Finished waiting for all of the authorizer futures to be completed (kafka.server.BrokerServer)
[2024-02-26 10:38:26,890] INFO [BrokerServer id=1] Waiting for all of the SocketServer Acceptors to be started (kafka.server.BrokerServer)
[2024-02-26 10:38:26,890] INFO [BrokerServer id=1] Finished waiting for all of the SocketServer Acceptors to be started (kafka.server.BrokerServer)
[2024-02-26 10:38:26,890] INFO [BrokerServer id=1] Transition from STARTING to STARTED (kafka.server.BrokerServer)
[2024-02-26 10:38:26,891] INFO Kafka version: 3.7.0 (org.apache.kafka.common.utils.AppInfoParser)
[2024-02-26 10:38:26,891] INFO Kafka commitId: 5e3c2b738d253ff5 (org.apache.kafka.common.utils.AppInfoParser)
[2024-02-26 10:38:26,891] INFO Kafka startTimeMs: 1708943906890 (org.apache.kafka.common.utils.AppInfoParser)
[2024-02-26 10:38:26,892] INFO [KafkaRaftServer nodeId=1] Kafka Server started (kafka.server.KafkaRaftServer)The output shows that Kafka has successfully initialized using KRaft and is accepting connections on 0.0.0.0:9092.
The process exits when you press CTRL + C. Since it is not preferable to run Kafka with a session open, in the next step you will create a service to run Kafka in the background.
Step 2 – Create a systemd service for Kafka
In this section, you will create a systemd service to run Kafka in the background at all times. Systemd services can be started, stopped, and restarted continuously.
You store the service configuration in a file called code-server.service in the /lib/systemd/system directory, where systemd stores its services. Create it using your text editor:
sudo nano /etc/systemd/system/kafka.serviceAdd the following lines:
[Unit]
Description=kafka-server
[Service]
Type=simple
User=kafka
ExecStart=/bin/sh -c '/home/kafka/kafka/bin/kafka-server-start.sh /home/kafka/kafka/config/kraft/server.properties > /home/kafka/kafka/kafka.log 2>&1'
ExecStop=/home/kafka/kafka/bin/kafka-server-stop.sh
Restart=on-abnormal
[Install]
WantedBy=multi-user.targetHere you first specify the service description. Then in the [Service] field you define the service type (simple means that the command should be executed simply) and provide the command that will be executed. You also specify that the user it will run as is kafka, and that the service should be restarted automatically if Kafka exits.
The [Install] section instructs the system to start this service when you are able to log in to your server. When finished, save and close the file.
Start the Kafka service by running the following command:
sudo systemctl start kafkaCheck that it started correctly by viewing its status:
sudo systemctl status kafkaYou will see output similar to the following:
Output
● kafka.service - kafka-server
Loaded: loaded (/etc/systemd/system/kafka.service; disabled; preset: enabled)
Active: active (running) since Mon 2024-02-26 11:17:30 UTC; 2min 40s ago
Main PID: 1061 (sh)
Tasks: 94 (limit: 4646)
Memory: 409.2M
CPU: 10.491s
CGroup: /system.slice/kafka.service
├─1061 /bin/sh -c "/home/kafka/kafka/bin/kafka-server-start.sh /home/kafka/kafka/config/kraft/server.properties > /home/kafka/kafka/kafka.log 2>&1"
└─1062 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true "-Xlog:gc*:file=/home/kafka/kafka/bin/../logs/kaf>
Feb 26 11:17:30 kafka-test1 systemd[1]: Started kafka.service - kafka-server.To automatically start Kafka after server restart, enable its service by running the following command:
sudo systemctl enable kafkaAt this point, you have created and enabled a systemd service for Kafka, so that it starts on every server boot. Next, you will learn how to create and delete topics in Kafka, as well as how to produce and consume text messages using the available scripts.
Step 3 – Producing and consuming topical messages
Now that you have set up a Kafka server, you will be introduced to topics and how to manage them using the provided scripts. You will also learn how to send and receive messages from a topic.
As explained in the Event Stream article, publishing and receiving messages are associated with topics. A topic can be related to the category to which a message belongs.
The provided script kafka-topics.sh can be used to manage topics in Kafka via CLI. To create a topic named first-topic, run the following command:
bin/kafka-topics.sh --create --topic first-topic --bootstrap-server localhost:9092All provided Kafka scripts require you to specify the server address with --bootstrap-server.
The output will be:
Output
Created topic first-topic.To list all existing topics, replace –create with –list:
bin/kafka-topics.sh --list --bootstrap-server localhost:9092You see the topic you created:
Output
first-topicYou can get detailed information and statistics about the topic by passing in --describe:
bin/kafka-topics.sh --describe --topic first-topic --bootstrap-server localhost:9092The output will look like this:
Output
Topic: first-topic TopicId: VtjiMIUtRUulwzxJL5qVjg PartitionCount: 1 ReplicationFactor: 1 Configs: segment.bytes=1073741824
Topic: first-topic Partition: 0 Leader: 1 Replicas: 1 Isr: 1The first line specifies the topic name, ID, and recurrence factor, which is 1 because the topic only exists on the current machine. The second line is intentionally indented and shows information about the first (and only) partition of the topic. Kafka allows you to partition the topic, meaning that different parts of a topic can be distributed across different servers, increasing scalability. Here, there is only one partition.
Now that you have created a topic, you will produce messages for it using the kafka-console-producer.sh script. Run the following command to start the producer:
bin/kafka-console-producer.sh --topic first-topic --bootstrap-server localhost:9092You will see a blank notification:
>The provider is waiting for your SMS. Enter the test and press ENTER. The notification will look like this:
>test
>The producer is now waiting for the next message, meaning the previous message has been successfully delivered to Kafka. You can enter any number of messages for testing. To exit the producer, press CTRL+C.
To retrieve topic messages, you need a consumer. Kafka provides a simple consumer in the form of kafka-console-consumer.sh. Run it by running:
bin/kafka-console-consumer.sh --topic first-topic --bootstrap-server localhost:9092However, there will be no output. This is because the consumer is streaming data from the topic and nothing is being produced or sent at the moment. To consume messages that you produced before the consumer started, you need to read the topic from the beginning by running:
bin/kafka-console-consumer.sh --topic first-topic --from-beginning --bootstrap-server localhost:9092The consumer replays all topic events and fetches messages:
Outputtest
...As with the builder, press CTRL+C to exit.
To verify that the consumer is actually streaming data, open it in a separate terminal session. Open a secondary SSH session and run the consumer in the default configuration:
bin/kafka-console-consumer.sh --topic first-topic --bootstrap-server localhost:9092In the initial session, run the constructor:
bin/kafka-console-producer.sh --topic first-topic --bootstrap-server localhost:9092Then enter your desired messages:
>second test
>third test
>You will immediately see them received by the consumer:
Output
second test
third testAfter testing is complete, terminate both the producer and consumer.
To delete the first topic, pass --delete to kafka-topics.sh:
bin/kafka-topics.sh --delete --topic first-topic --bootstrap-server localhost:9092There will be no output. You can list the topics to verify that they were indeed deleted:
bin/kafka-topics.sh --list --bootstrap-server localhost:9092The output will be:
Output
__consumer_offsets__Consumer_Equivalent is an internal topic for Kafka that stores the amount of time a consumer has read a topic.
At this point, you have created a Kafka topic and created messages in it. Then, you have consumed the messages using the provided script and finally received them in real-time. Next, you will learn how Kafka compares to other event brokers and similar software.
Comparison with similar architectures
Apache Kafka is considered the de facto solution for event streaming use cases. However, Apache Pulsar and RabbitMQ are also widely used and stand out as versatile options, albeit with differences in their approach.
The main difference between a message queue and an event stream is that the primary task of the former is to deliver messages to consumers in the fastest possible manner, regardless of their order. Such systems typically store messages in memory until they are acknowledged by consumers. Message filtering and routing are important aspects, as consumers can show interest in specific categories of data. RabbitMQ is a strong example of a traditional messaging system where multiple consumers can subscribe to a topic and receive multiple copies of a message.
On the other hand, event streaming focuses on persistence. Events should be archived, maintained, and processed once. Routing them to specific consumers is not important, as the idea is that all consumers process events the same way.
Apache Pulsar is an open source messaging system developed by the Apache Software Foundation that supports event streaming. Unlike Kafka, which it was built with from the ground up, Pulsar started out as a traditional message queuing solution and later acquired event streaming capabilities. Pulsar is therefore useful when a combination of both approaches is needed, without the need to deploy separate applications.
Result
You now have Apache Kafka running securely in the background of your server, configured as a system service. You have also learned how to manipulate topics from the command line, as well as produce and consume messages. However, the main appeal of Kafka is the wide variety of clients for integrating it into your applications.









