What is Kafka? Apache Kafka is an open source project providing powerful distributed processing of continuous data streams – and is currently trusted in production by thousands of enterprises globally including the likes of Netflix, Twitter, Spotify, Uber and more.
The technology architecture and implementation makes it highly reliable and highly available, enabling stream processing applications to utilise geographically distributed data streams. While Kafka is not difficult to use, it’s tricky to optimise.
Here are our first 5 tips and tricks that will help you perfect your Kafka system to get ahead!
- Logs:
Kafka has a lot of log configuration. The defaults are generally sane but most users will have at least a few things they will need to tweak for their particular use case. You need to think about retention on policy, clean ups, compaction, and compression.
- Hardware Requirements:
When tech teams start playing with Kafka there is a tendency to just sort of ‘ballpark’ the hardware - just spin up a big server and hope it works. Kafka does not necessarily need a ton of resources. It is designed for horizontal scaling thus you can get away with using relatively cheap commodity hardware.
CPU: Doesn’t need to be very powerful unless you’re using SSL and compressing logs. The more cores, the better for parallelisation. If you do need compression we recommend using LZ4 codec for best performance in most cases.
Memory:Kafka works best when it has at least 6 GB of memory for heap space. The rest will go to OS page cache which is key for client throughput. Kafka can run with less RAM, but don’t expect it to handle much load. For heavy production on use cases go for at least 32 GB.
Disk: Because of Kafka’s sequential disk I/O paradigm SSD’s will not offer much benefit. Do not use NAS. Multiple drives in a RAID setup can work well.
Network and File system:Use XFS and keep your cluster in a single datacentre if possible. The higher the network bandwidth the better.
- Zookeeper:
We could do an entire article just on ZooKeeper. It is a versatile piece of software that works great for both service discovery and a range of distributed configuration on use cases.
- Avoid co-locating ZooKeeper in any major production environment
This is a shortcut many companies take thanks to the spread of Docker.. This is fine for a development environments or even smaller production on deployments assuming you take the right precautions. The risk with larger systems is that you lose more of your infrastructure if a single server goes down. It’s also suboptimal for your security setup because Kafka and ZooKeeper are likely going to have a very different set of clients and you will not be able to isolate them as well.
- Do not use more than five ZooKeeper nodes without a really great reason
For a dev environment, one node is fine. For your staging environment you should use the same number of nodes as production. In general three ZooKeeper nodes will suffice for a typical Kafka cluster. If you have a very large Kafka deployment, it may be worth going to five ZooKeeper nodes to improve latency, but be aware this will put more strain on the nodes. • Tune for minimal latency
Use servers with really good network bandwidth. Use appropriate disks and keep logs on a separate disk. Isolate the ZooKeeper process and ensure that swap is disabled. Be sure to track latency in your instrumentation dashboards.
- Replication and Redundancy
There are a few dimensions to consider when thinking about redundancy with Kafka. The first and most obvious is just the replica on factor. We believe Kafka defaults at 2 but for most production uses 3 is best. It will allow you to lose a broker and not freak out. If, improbably, a second broker also independently fails, your system is still running. Alongside replica on factor you also have to think about datacenter racks zones.
- Topic Config
Your Kafka cluster’s performance will depend greatly on how you configure your topics. In general you want to treat topic configuration as immutable since making changes to things like partition count or replica on factor can cause a lot of pain. If you find that you need to make a major change to a topic, often the best solution is to just create a new one. Always test new topics in a staging environment first.
As mentioned above, start at 3 for replica on factor. If you need to handle large messages, see if you can either break them up into ordered pieces (easy to do with par on keys) or just send pointers to the actual data (links to S3 for example). If you absolutely have to handle larger messages be sure to enable compression on the producer’s side. The default log segment size of 1 GB should be fine (if you are sending messages larger than 1 GB, reconsider your use case). Partition count, possibly the most important setting, is addressed in the next section.
Instaclustr’s competitive edge:
With the addition of Manged Kafka to the suite of solutions available through Instaclustr’s Open Source-as-a-Service platform, organizations using Instaclustr-managed Kafka are selecting an experienced provider distinguished by more than 20 million node hours under management and available technical teams that bring deep Kafka-specific expertise.
The managed Kafka offering follows the robust provisioning and management patterns used to deliver other leading open source technologies provided through the Instaclustr platform – including Apache Cassandra, Apache Spark, and Elassandra. Instaclustr Managed Apache Kafka is backed by advanced data technologies designed to deliver easy scalability, high performance, and uninterrupted availability. Additionally, Instaclustr provides customers with a SOC2 certified Kafka managed service, further ensuring secure data management and safeguarding client privacy.
About Instaclustr
Instaclustr is the Open Source-as-a-Service company, delivering reliability at scale. We operate an automated, proven, and trusted managed environment, providing database, analytics, search, and messaging. We enable companies to focus internal development and operational resources on building cutting-edge customer-facing applications.
For more information, visit www.instaclustr.com
Author: Judy Sahay
Judy Sahay is the and managing director of Crowd Media Group. A group of three companies focused on Digital Media, Tech and Big Data. She is passionate about sharing knowledge in IT News & Trends, IoT, Big Data, the changing media landscape, along with tips and ticks on digital media strategies, social media and influencer marketing engagements.