~/jonas/writing $ cat off-kafka-single-node.md

Why I moved off Kafka for a single-node workload

2026-05-03 · ~6 min

I inherited a Kafka cluster the way most people do: someone, at some point, had a problem that genuinely needed a log, solved it, and then everything downstream got built assuming the log was there. By the time it reached me the actual workload was one producer, one consumer group, and maybe forty messages a second on a busy day. It ran on a single broker because there was only ever one box.

A single-broker Kafka is a contradiction. The whole value proposition — partition replication, leader election, surviving a node loss — requires more than one node. With one broker you get all the operational surface area of a distributed log and none of its guarantees. You have bought the maintenance contract for a fault-tolerant system and deployed it in a configuration where the first hardware fault loses data.

The tax

What that single broker actually cost me, month over month:

None of these are Kafka's fault. Kafka is excellent at being Kafka. The fault was running it for a job that did not need a distributed commit log, in a topology where it couldn't even provide one.

What the workload actually was

I sat down and wrote out what the consumers needed, stripped of the vocabulary the existing system had imposed:

Durably accept a unit of work. Hand it to exactly one worker. If the worker dies, give it to another one. Retry a few times, then give up loudly. Survive a restart.

That is a job queue. It is not a stream. We weren't replaying history, we weren't fanning out to multiple independent consumers, we weren't doing windowed aggregation. We were doing tasks, one at a time, and calling them messages because the tool we had used that word.

The replacement

I already run Postgres for this service's actual data. So the queue became a table and a claim query built on SELECT ... FOR UPDATE SKIP LOCKED — the mechanics are their own post, but the shape is:

UPDATE jobs SET status = 'running', locked_by = $1
WHERE id = (
  SELECT id FROM jobs WHERE status = 'queued'
  ORDER BY id FOR UPDATE SKIP LOCKED LIMIT 1
)
RETURNING id, payload;

Multiple workers, no collisions, no coordinator. Durability and crash-restart come from Postgres, which I was already backing up and already monitoring. The migration itself was a few hundred lines and a cutover window measured in minutes, because the producer just started INSERT-ing rows instead of producing to a topic.

The result

Deleting the broker removed two long-running processes, one config file's worth of retention footguns, and an entire category of "is it the client or the broker" debugging. The p99 latency went down, mostly because a local Postgres round-trip is faster than the producer ack path I'd had. Throughput ceiling dropped from "theoretically enormous" to "a few thousand a second," which is still two orders of magnitude over what this thing will ever see.

I want to be careful here: this is not "Kafka bad." If you have multiple independent consumers replaying a durable log across a cluster you actually run as a cluster, Kafka is the right tool and Postgres is not. The lesson is narrower and more boring. Match the tool to the topology you actually run, not the one the tool assumes. A distributed system on one node is just an expensive single node.

I default to the boring option now and make the fancy one earn its place. So far it almost never does.

← back to writing