Message queues are usually a core part of any distributed architecture, and the options are endless: Kafka, RabbitMQ, NATS, Redis Streams, SQS, ZeroMQ... and then there's the “just use Postgres” camp for simpler use cases.
I’m trying to make sense of the tradeoffs between:
- async fire-and-forget pub/sub vs. sync RPC-like point to point communication
- simple FIFO vs. priority queues and delay queues
- intelligent brokers (e.g. RabbitMQ, NATS with filters) vs. minimal brokers (e.g. Kafka’s client-driven model)
There's also a fair amount of ideology/emotional attachment - some folks root for underdogs written in their favorite programming language, others reflexively dismiss anything that's not "enterprise-grade". And of course, vendors are always in the mix trying to steer the conversation toward their own solution.
If you’ve built a production system in the last few years:
1. What queue did you choose?
2. What didn't work out?
3. Where did you regret adding complexity?
4. And if you stuck with a DB-based queue — did it scale?
I’d love to hear war stories, regrets, and opinions.
Mostly because it has been very reliable for years in production at a previous company, and doesn’t require babysitting. Its recent versions also has new features that make it is a descent alternative to Kafka if you don’t need to scale to the moon.
And the logo is a rabbit.
For smaller projects of "job queues," I tend to use Amazon SQS or RabbitMQ.
But just for clarity, Kafka is not really a message queue -- it's a persistent structured log that can be used as a message queue. More specifically, you can replay messages by resetting the offset. In a queue, the idea is once you pop an item off the queue, it's no longer in the queue and therefore is gone once it's consumed, but with Kafka, you're leaving the message where it is and moving an offset instead. This means, for example, that you can have many many clients read from the same topic without issue.
SQS and other MQs don't have that persistence -- once you consume the message and ack, the message disappears and you can't "replay it" via the queue system. You have to re-submit the message to process it. This means you can really only have one client per topic, because once the message is consumed, it's no longer available to anyone else.
There are pros and cons to either mechanism, and there's significant overlap in the usage of the two systems, but they are designed to serve different purposes.
The analogy I tend to use is that Kafka is like reading a book. You read a page, you turn the page. But if you get confused, you can flip back and reread a previous page. An MQ like RabbitMQ or Sidekiq is more like the line at the grocery store: once the customer pays, they walk out and they're gone. You can't go back and re-process their cart.
Again, pros and cons to both approaches.
"What didn't work out?" -- I've learned in my career that, in general, I really like replayability, so Kafka is typically my first choice, unless I know that re-creating the messages are trivial, in which case I am more inclined to lean toward RabbitMQ or SQS. I've been bitten several times by MQs where I can't easily recreate the queue, and I lose critical messages.
"Where did you regret adding complexity?" -- Again, smaller systems that are just "job queues" (versus service-to-service async communication) don't need a whole lot of complexity. So I've learned that if it's a small system, go with an MQ first (any of them are fine), and go with Kafka only if you start scaling beyond a single simple system.
"And if you stuck with a DB-based queue -- did it scale?" -- I've done this in the past. It scales until it doesn't. Given my experience with MQs and Kafka, I feel it's a trivial amount of work to set up an MQ/Kafka, and I don't get anything extra by using a DB-based queue. I personally would avoid these, unless you have a compelling reason to use it (eg, your DB isn't huge, and you can save money).