In short

  • Idempotency is the fastest way to prevent duplicate side effects in webhook systems.
  • You don't need a perfect architecture to start. You need clear keys, storage rules, and replay discipline.
  • A small, consistent checklist protects billing, inventory, and user-facing workflows from day one.

If your team ships quickly, duplicates will happen. At-least-once delivery systems guarantee it. The goal isn't to prevent every duplicate forever. The goal is to make duplicates harmless when they arrive.

Why this matters earlier than most teams think

A lot of teams postpone idempotency until they hit scale. In practice, duplicate events show up during routine incidents: a timeout triggers a retry, a queue replays events after a restart, or a provider resends a batch after a failed acknowledgment.

Here's a concrete scenario. A billing webhook fires for a $200 charge. The endpoint processes it but responds slowly. The provider times out and retries. Now your system has processed the same charge twice. That's a support ticket, a refund, and a trust hit. A five-line idempotency check prevents the entire chain.

The earlier you add idempotency to high-risk flows, the fewer of these incidents you'll deal with.

Idempotency flow
How idempotency keys prevent duplicate processing. New events write an atomic key and process. Duplicates skip safely.

1. Pick one stable idempotency key

Start with a single key strategy and keep it consistent across your services. This gives your team a shared contract that everyone can reason about.

In order of preference:

  1. Provider event ID (if the provider guarantees stability across retries)
  2. Delivery ID from your ingress layer (if you control the entry point)
  3. Deterministic hash of business-critical fields (amount + customer ID + timestamp, for example)

The important thing is consistency. If your billing service uses the provider event ID and your notifications service uses a hash, debugging cross-service duplicates becomes painful.

For HTTP semantics and how status codes interact with retries, RFC 9110 is the reference.

2. Store dedupe decisions with a clear TTL

A dedupe key without a retention policy creates confusion over time. How long should a key block duplicates? That depends on the business risk of a late duplicate.

Event typeSuggested TTLWhy
Billing and payments30 daysProtects financial side effects and gives support time to investigate
Subscription lifecycle7 to 14 daysCovers most retry and replay windows
Non-critical notifications24 to 72 hoursLower business risk, shorter storage cost

If you're unsure, start conservative and shorten later once you have data on your actual retry and replay patterns.

3. Track processing state, not just "seen or not"

A simple "yes, we've seen this event" flag works until something goes wrong during processing. If the event was received but failed halfway through, you need to know that, otherwise replays won't work correctly.

Track these states:

  • received (event arrived, not yet processed)
  • processing (handler is working on it)
  • completed (successfully processed)
  • failed-retryable (failed, safe to retry)
  • failed-terminal (failed, needs manual intervention)

This lets your operations team understand exactly what happened without digging through raw logs. During an incident, that clarity saves real time.

4. Handle race conditions explicitly

Two workers can pick up the same event at almost the same time. If your key write isn't atomic, both workers might check "has this been processed?", both get "no," and both proceed. Now you have a duplicate.

Use one of these atomic guard patterns:

  • Unique key insert with conflict detection (Postgres INSERT ... ON CONFLICT DO NOTHING)
  • Transactional upsert with row lock (MySQL SELECT ... FOR UPDATE)
  • Compare-and-set with conditional write (DynamoDB conditional expressions)

The exact primitive depends on your datastore, but the principle is the same: exactly one worker wins the write. Everyone else exits cleanly without processing.

5. Make replay a first-class workflow

Idempotency and replay go hand in hand. When an incident is over and the root cause is fixed, you need to reprocess the events that failed. That's replay.

A safe replay flow should capture:

  • Who triggered the replay and why
  • Which events are included (single event, time range, filter criteria)
  • A dry-run option for batch replays
  • A verification step after replay completes

Without structure around replays, someone will eventually replay 50,000 events into production without checking what they do. Dry-run mode prevents that.

For the retry side of this, see the retry best practices guide.

6. Add visibility your team can act on

A checklist is only complete when you can see where the system is drifting. Track at least these four signals:

  • Dedupe hit rate: how often are duplicates actually arriving?
  • Duplicate side-effect incidents: how many times did a duplicate cause real damage?
  • Replay success rate: when you replay events, how often do they succeed?
  • Key-store latency: is your dedupe lookup slowing down as the table grows?

If dedupe hit rate jumps from 2% to 15%, something changed upstream. Maybe a provider is retrying more aggressively, or one of your services started resending events. Either way, you want to know before it becomes a customer issue.

For a full monitoring setup, see the webhook monitoring checklist.

7. Start with one endpoint and expand

You don't need a platform rewrite to get value from idempotency. Start with the endpoint that would hurt the most if duplicates slipped through.

A practical rollout:

  1. Pick your highest-risk endpoint (billing, entitlement, or inventory)
  2. Add key storage and conflict handling
  3. Add replay metadata (who, why, scope)
  4. Add dashboards and alerts for dedupe metrics
  5. Expand to the next endpoint

This keeps momentum high. Each endpoint you protect is a concrete win your team can point to.

Ready to ship with confidence?

If you want the safest first week, implement idempotency keying, TTL storage, and atomic conflict handling. Then add replay tooling and visibility. That sequence gives strong protection with minimal friction.

Start free · Read the docs