This guide is for developers and technical leads building webhook receivers for production systems — Stripe events, Xero updates, Slack interactions, anything where another service posts data to your endpoint and expects it to be processed. By the end you will know how to verify signatures, handle duplicates safely, deal with the inevitable retries, log enough to debug a production issue at 3am, and design for the failure modes that webhooks reliably produce.
Who This Guide Is For
Developers and technical architects building or maintaining custom integrations where a third-party service pushes events to your application. This applies whether you are processing five webhooks a day from a single integration or fifty thousand a day from a busy e-commerce platform. The reliability patterns are the same; the failure cost is what scales.
Before You Start
You should know which service is sending the webhooks, what events it sends, and what the documented retry behaviour is. Different platforms behave very differently: Stripe retries aggressively for up to three days, Slack retries three times within a few minutes, GitHub does not retry at all if you return a 5xx. The handler design depends on these behaviours.
You should also have a sense of what the webhook does in your system. A webhook that triggers an invoice creation is a different reliability target to one that just logs an analytics event. Match the rigour to the consequence.
Always Verify the Signature
The first responsibility of any webhook handler is to confirm the request actually came from the sender. Webhooks are public HTTPS endpoints; anyone on the internet can post to them. Without signature verification, anyone can post fake events to your system and trigger whatever logic the handler runs.
Most webhook providers sign requests with a shared secret. Stripe uses Stripe-Signature header with HMAC-SHA256, Slack uses X-Slack-Signature, GitHub uses X-Hub-Signature-256. The pattern is the same: take the request body, the timestamp, and the shared secret, compute an HMAC, compare it to the header. If it does not match, reject the request.
Some critical points that get missed:
- Verify the signature on the raw body. Once you parse the JSON, you have changed the byte sequence (whitespace, key order) and the signature no longer matches. Verify first, parse second.
- Reject requests with old timestamps. Most signature schemes include a timestamp. A request more than five minutes old is almost certainly a replay attack — reject it.
- Use constant-time comparison. Comparing the computed signature to the header with a regular string equality is vulnerable to timing attacks. Use the language’s constant-time compare function (
hash_equalsin PHP,crypto.timingSafeEqualin Node).
If verification fails, return a 401 and log the attempt. Do not process the payload. The few failed legitimate requests this might cost are far cheaper than processing a fake event.
Return Fast, Process Asynchronously
The naive pattern is to do all the work in the webhook handler — process the event, update the database, call other services, send notifications, then return 200. This is wrong for almost every production webhook.
The problem is that webhook senders have timeouts. Stripe gives you about 10 seconds. If your handler takes longer, Stripe gives up, considers the delivery failed, and retries. You end up doing the same work twice, or your work fails partway and leaves the system in an inconsistent state.
The correct pattern is to receive the webhook, verify the signature, write the event to a queue or storage, return 200 immediately, then process the event asynchronously. The handler does the minimum work: validate, persist, acknowledge. A worker process picks up the event from the queue and does the actual processing.
A concrete example. A Stripe webhook for invoice.payment_succeeded triggers three things in our system: an invoice status update, a customer notification email, and a record in the audit log. The handler itself does only one of those — writes the event to a webhook_events table — and returns 200 within 100 milliseconds. A worker reads the table and does the three downstream actions. If any one of them fails, only that step retries; the other two are not blocked.
The added benefit is that the queue gives you replay capability. If the worker has a bug, fix it and re-process the events. If a downstream service was down, the queue keeps the events until it is back.
Make the Handler Idempotent
Every webhook will eventually be delivered more than once. The sender retries on timeout. The network duplicates. The infrastructure replays during failover. If your handler is not idempotent — if processing the same event twice produces different results from processing it once — you will have data corruption.
The standard pattern: every webhook event has an ID provided by the sender (Stripe’s event.id, Slack’s event_id, etc.). Before processing, check whether you have already seen this ID. If yes, return 200 and skip processing. If no, record the ID and proceed.
The recording must be transactional with the processing. If you check the ID, process the event, then record the ID, a duplicate delivery between the processing and recording will process twice. Use a database constraint (a unique index on the event ID) and let the database enforce idempotency — the second insert fails, and you treat that failure as “already processed”.
For events without a usable ID (rare in modern APIs), generate a deterministic ID from the payload contents — a hash of the relevant fields. The downside is that this catches duplicate payloads but might allow legitimate duplicate events to pass through. Read the docs carefully before using this fallback.
Log Enough to Debug
When a webhook fails in production at 3am, the diagnosis depends entirely on what you logged. The patterns that pay back:
- Log the full raw payload. Disk is cheap; the ability to replay the exact request later is invaluable. If the payload includes sensitive data (PII, payment details), redact selectively rather than dropping the log entirely.
- Log the verification step explicitly. Signature passed, timestamp accepted, event ID checked. When verification fails in production, you want to know which step failed.
- Log the processing outcome. Success, retry queued, permanent failure. Tag with the event ID so you can correlate logs across systems.
- Tag with a request ID. Most frameworks generate one per request. Pass it through to the worker so the entire processing trail can be reconstructed from one identifier.
The discipline is that someone looking at the logs three weeks later should be able to answer “what happened with event X” without needing the original developer. That requires structured logs that survive search, not free-text strings that drift in format.
Handle Failures the Right Way
Failures in the asynchronous processing are inevitable. Network blips, third-party API timeouts, transient database issues. The handler design needs to distinguish between failures that should retry and failures that should not.
The categories:
- Transient: third-party service is temporarily unavailable, network glitched, the database had a brief connection issue. Retry with exponential backoff — 30 seconds, two minutes, ten minutes, an hour.
- Permanent: the payload references a record that does not exist, the requested operation is invalid, the input fails validation. Retrying will not help. Move to a dead letter queue and alert a human.
- Unknown: an unexpected exception. Retry a few times, then escalate to permanent and alert.
The dead letter queue is the safety net. Permanent failures and exhausted retries land there, and a human reviews them. The queue should be queryable and small — if it is filling up with hundreds of events, the system has a real problem that needs attention, not just a hopper of inevitable failures.
Set a retry cap. Three to five retries with exponential backoff covers transient failures; beyond that, the failure is almost certainly not transient. Continuing to retry indefinitely fills the queue with hopeless events and obscures the real ones.
Monitor the Webhook Stream
The webhook integration is a production system, and like any production system, it needs monitoring. The dimensions:
- Volume: the number of webhooks received per hour. A sudden drop usually means the sender has stopped delivering — investigate, because you might be silently missing events.
- Latency: the time from receipt to processing complete. Growing latency means the worker is falling behind.
- Failure rate: the percentage of events that end up in the dead letter queue. A spike means something has changed — either the payload format, the downstream service, or your logic.
- Signature failures: a spike in signature mismatches can mean an attack, a key rotation issue, or a misconfiguration.
Alert on the things that matter. A drop in volume to zero for fifteen minutes is an alert. A failure rate above 2% is an alert. Signature failures above background levels is an alert. Tuning the thresholds takes a few weeks of operating data; do not skip it because alert fatigue defeats the system.
Common Mistakes
- Doing all the work in the handler. Times out, retries fire, work duplicates. Receive, persist, return 200, process asynchronously.
- Not verifying signatures. Anyone on the internet can post events. Always verify, always before parsing the body.
- Comparing signatures with regular string equality. Vulnerable to timing attacks. Use constant-time compare.
- Non-idempotent processing. Duplicate deliveries are guaranteed. Idempotency is non-negotiable.
- Logging the happy path only. When something goes wrong, the logs are useless. Log verification, processing, and outcomes for every event.
- No dead letter queue. Permanently failing events get retried forever or silently dropped. Either is bad. Move them to a dead letter and alert.
- No monitoring on volume. The most insidious failure is a webhook that stops being delivered. Without volume monitoring, you find out weeks later when something else breaks downstream.
- Trusting the IP-allowlist instead of the signature. IP allowlists work until the sender changes their infrastructure. Signature verification is what you can rely on.
What Good Looks Like
A well-built webhook handler verifies the signature on the raw body in constant time, returns 200 within 100 milliseconds for almost every request, processes the event asynchronously via a queue, is idempotent on the sender’s event ID, logs every step in a structured format, retries transient failures with exponential backoff, sends permanent failures to a dead letter queue, and is monitored on volume, latency, and failure rate. Six months in, the handler has processed hundreds of thousands of events with no data corruption, no silent failures, and a debugging path that anyone on the team can follow.
Next Steps
If the integration is one of many connecting business systems, How to Plan a Multi-System Integration covers the broader architecture. If your handler is exposing an API for outbound traffic as well, How to Secure an API and How to Rate-Limit an API are the next reads. For building integration infrastructure as part of a larger engagement, see API Integrations.