How payment microservices stay consistent on an event bus

Every payments platform eventually adopts an event bus, and every platform that adopts one without fully internalizing the word eventually pays for the education in incidents visible to customers. A Kafka cluster in the middle of a microservice topology does not give you distributed systems. It gives you a distributed systems problem, at your expense, on your oncall schedule.

We have operated a payments platform processing several billion dollars a year across card acquiring, ACH, cross border wires, and internal transfers, fronted by around forty services communicating over a single event bus. The technology stack is unremarkable: Kafka, Postgres, a few Go services, a few Node services, the usual. What is worth writing down is the handful of design decisions that, in retrospect, determined whether we were debugging a misunderstanding of the business or a misunderstanding of the bus.

Eventual consistency is a contract, not an excuse

The sentence “the system is eventually consistent” tends to arrive in the same conversation as “so the duplicate refund is expected behavior.” It is not. Eventual consistency is a promise that the system converges on a correct state within a bounded, observable window. If you cannot state the window in numbers, for example “99.9% of authorization events are reflected in the risk service within 800 ms,” you do not have an eventually consistent system. You have a system with unknown convergence, which is a different and worse thing.

The first artifact a payments team should produce when moving to an event bus is a freshness SLO per consumer, per event type. Measure end to end lag as the difference between the producer’s event timestamp and the consumer’s commit timestamp, at the 50th, 95th, and 99.9th percentiles. Alarm on the distributions, not the averages. Review the numbers weekly. The first week you stop measuring them is the week they stop being true.

The outbox pattern is not optional

The most common cause of “we charged the card but the order is missing” is a dual write. The service commits the payment row to Postgres, then publishes a payment.authorized event to Kafka. Between the two, anything can happen: the broker rejects the publish, the pod is SIGKILLed, the network partitions. The database and the bus now disagree about what the world looks like. In a payments context, that disagreement is money.

The fix is old and boring. The service writes the event to an outbox table in the same Postgres transaction as the business write. A separate relay (Debezium reading the Postgres WAL, or a polling worker) ships rows from the outbox to Kafka and marks them delivered. The business transaction and the “will be published” record commit together or roll back together. There is no window in which the database knows something the bus does not.

The one thing people skip is making the relay itself idempotent. Publish with a deterministic message_id derived from the outbox row’s primary key, and let Kafka’s idempotent producer semantics or the consumer’s dedup table absorb the retry. Without that, a relay restart mid batch, double publishes every event in flight.

At least once is the default. Plan for it explicitly.

Every consumer in a payments system will receive every event at least once, and will occasionally receive it more than once. This is not a Kafka limitation. It is the only delivery guarantee compatible with “the consumer crashed after processing but before committing the offset.” Exactly once exists for Kafka to Kafka pipelines under specific configuration. It does not exist once your side effect is a card network API call.

The engineering consequence is that every handler in a payments platform is an idempotent function of the event, or it is a bug. Concretely:

Each event carries a stable business key that is not the Kafka offset. Use the authorization ID, the transfer request ID, or the webhook event ID from the upstream provider.
Every state changing handler persists that key before performing the side effect, inside a transaction that also records the result. A second delivery finds the key, reads the prior result, and returns it unchanged.
The dedup table is owned by the consumer, not the bus. You cannot trust the producer to retry safely; you have to assume it will not.

The test for this is ugly but mandatory. In staging, replay a random hour of the event stream into a warmed consumer and assert the resulting state is identical. We run this weekly. It finds regressions that no unit test will.

Ordering is a per entity property, not a global one

Kafka preserves order within a partition. It does not, and cannot, preserve order across partitions. In a payments platform, the question is which entity’s order matters. The answer is almost always the account, the customer, or the payment, never the platform as a whole.

Partition by the entity whose state transitions you are protecting. For authorization and capture events, the partition key is the authorization ID: a capture cannot be processed before its authorization, and the bus will now enforce that by construction. For customer scoped state, partition by customer ID. A global ordering key (all payment events on one partition) is either a throughput bottleneck or a lie. It scales until it does not, and then you repartition under pressure.

When two entities must be correlated and one lands before the other, for example a capture event arriving before the authorization enrichment event, the consumer’s job is to buffer, not to assume. Persist the out of order event against its business key, wait for its partner with a bounded timeout, and treat the timeout as a first class error state with its own alert. Sweeping this under “retries” is how silent inconsistencies compound.

Sagas, compensation, and the refund that beat the capture

Distributed transactions in payments are sagas. A sequence of local transactions, each of which has an explicit compensating action. A cross border transfer that debits a sender account, calls an FX provider, and credits a recipient account is three local steps with three compensations: release the debit, cancel the FX quote, void the credit. Write the compensations first. If you cannot write the compensation for a step, you cannot safely take the step.

The counterintuitive failure mode is the one where the compensation outruns the forward path. A user clicks “cancel” while the capture is still in flight; the refund event hits the provider before the capture event does. We have seen this break a platform that assumed capture would always precede refund. The fix is in the saga, not in the bus. Every compensating step is guarded by a precondition check against the current saga state, persisted in the saga coordinator’s own store. If the capture has not landed, the refund either waits on it or is explicitly marked as “cancel before capture,” a different operation with different accounting.

Customer facing UX is where eventual consistency gets expensive

Read your writes is the bug report that never uses those words. The customer hits “pay,” the POST succeeds, the redirect lands them on the orders page, and the order is not there yet, because the orders read model is a projection that has not caught up to the event stream. They click pay again. You have now charged them twice unless the idempotency work upstream is flawless.

Two techniques carry the weight. First, the write path returns the projected state directly: a Location header, a prerendered receipt, a payment intent object. The client does not need to read it again to confirm. Second, the UI that does read again is aware of the consistency window and shows an honest loading state for the duration, not a blank grid. “Your payment is processing” is a better answer than “No payments found” for the three seconds it takes the projection to catch up.

Protect the write path with a server side idempotency key that is generated before the user clicks, not after. The key travels with the form. A resubmit returns the original result. A double submit due to an impatient refresh is a noop. Every payments provider we integrate with requires this on their API surface for the same reason. Make it the default on yours.

Observability is the schema of the platform, not a dashboard

Tracing across an event bus is harder than tracing across HTTP because the causality is not in the transport. Propagate a causation chain explicitly. Every event carries its own ID, the ID of the event that caused it, and the ID of the originating user action. Store these in every consumer. Now the question “what happened to payment pay_01HX…” is answerable by a single indexed query across services, without joining to logs.

The two metrics that matter most, and that almost no one graphs on day one, are consumer lag by partition and dead letter rate by event type. Lag tells you convergence is broken before the customer does. Dead letter rate tells you the handler is lying about idempotency, because a true idempotent handler has no reason to refuse a retry. Alert on both. Read the dead letter queue every morning for the first quarter; you will find business logic bugs that no QA pass would have caught.

The test that matters

Here is the exercise I have run with every payments team I have joined. Pick a recent payment at random. From the event log alone, with no application database and no provider dashboard, reconstruct the full causal history: the user intent that initiated it, every service that touched it, every retry, every compensation, every projection update, ending at the final settled state. Can you do it in under five minutes? Can an engineer who joined last month do it?

If yes, your eventual consistency is a contract you can hold customers to. If no, it is a story you are telling yourself, and the customer will eventually be the one to notice the difference. Payments is not the domain where you want that feedback loop.