Stripe Webhook Ingress — Implementation Plan
Status: done
Implements stripe-webhook-ingress.md. Closes the spec/code divergence noted in Known Gap #1.
Goal
Bring the live code into compliance with the spec by introducing a fast-claim → async-process → sweeper-backstop architecture. Driven by the May 1 2026 production incident (383 P2028 errors during monthly billing burst).
Status
- [x] Item 3 —
WebhookEventService.findStuckEvents(shipped 2026-05-01) - [x] Item 1 — Refactor live ingress into claim + async-process
- [x] Item 2 — Export
processClaimedEventhelper - [x] Item 4 — New sweeper cron
- [x] Item 5 — Register the cron
- [x] Item 6 — Tests for items 1, 2, 4
- [x] Item 7 — Spec follow-up (update Known Gap #1, drop (target) from File map)
Driving citations
- Spec Invariant 1 (at-least-once) — requires retry mechanism beyond Stripe's redelivery
- Spec Invariant 3 (fast acknowledgment) — current code violates by running
handleEvent(incl. Stripe API round-trips) inside the request transaction - Spec State and transitions — requires Stale Claim recovery (no current mechanism)
- Spec Contracts → Sweeper cron — defines the new cron contract
Work breakdown
1. Refactor live ingress into claim + async-process
- Modify
apps/api/src/app/api/stripe/webhooks/route.ts - Split current
withTransaction(...)block at line 74 into two phases:- Phase A — claim transaction (small, fast): runs
WebhookEventService.findExistingEvent+createEventorclaimEventForProcessing. Returns{ webhookEventId, alreadyProcessed }. - Phase B — async process via
next/after: runsStripeWebhookService.handleEvent+markAsProcessedormarkAsFailed, then dispatches side effects.
- Phase A — claim transaction (small, fast): runs
- Tighten claim-tx options:
timeout: 5000, maxWait: 10000(claim is single-row work, doesn't need 30s) - Tighten process-tx options:
timeout: 25000, maxWait: 10000(5s slack vs.maxDuration: 30; matches globalcore.tsdefault Aaron set inc55e88c63) - Add P2028 to retry-eligible conditions at current
route.ts:88-92— match'p2028'and'unable to start a transaction'substrings. Spec Invariant 1 — transient claim failure should self-heal in-process before falling through to sweeper. - Preserve
__testPrismasynchronous bypass atroute.ts:65-67so the existing 290+ tests inapps/api/tests/api/stripe/*.api.test.tscontinue passing without modification. - Spec coverage: Invariants 1, 3, 4, 5; State transitions: (no row) → Claimed; Claimed → Processed/Failed; HTTP contract
2. Export processClaimedEvent helper from the route
- In
apps/api/src/app/api/stripe/webhooks/route.ts, add an exported helper: export async function processClaimedEvent(webhookEventId: string, event: Stripe.Event): Promise<void>- Internally: opens
withTransaction({ audit: { source: 'webhook:stripe' } }), runsStripeWebhookService.handleEvent(event, tx)+WebhookEventService.markAsProcessed(webhookEventId, tx), then dispatches side effects (the existingafter()body at currentroute.ts:269-359becomes a call to this helper). - On throw:
WebhookEventService.markAsFailed(webhookEventId, error.message, tx)+captureError. - Sweeper cron imports from
@/src/app/api/stripe/webhooks/routedirectly. Single source of truth without manufacturing a new service file for one function. - Spec coverage: Invariants 1, 5, 9; Cross-system dependencies — handlers preserve their
txparameter shape
3. Add findStuckEvents to WebhookEventService — DONE
- Shipped:
apps/api/src/services/shared/WebhookEventService.ts:findStuckEvents - Signature:
findStuckEvents(source, { staleAfterMs, lookbackMs, limit }, tx) → Promise<StuckWebhookEvent[]> StuckWebhookEvent = { id, eventId, eventType, payload }- Tests:
apps/api/tests/services/shared/WebhookEventService.unit.test.ts(9 tests) - Implementation note: raw SQL with
FOR UPDATE SKIP LOCKED:
SELECT id, event_id AS "eventId", event_type AS "eventType", payload
FROM webhook_events
WHERE source = $1
AND processed = false
AND created_at > NOW() - ($lookback_ms || ' milliseconds')::interval
AND (processing_started_at IS NULL
OR processing_started_at < NOW() - ($stale_ms || ' milliseconds')::interval)
ORDER BY created_at ASC
LIMIT $limit
FOR UPDATE SKIP LOCKED
-
Spec correction: initial draft of this plan included
AND created_at < NOW() - $stale, which would have blocked sweeper retries of freshly-failed events for 5 min. Spec Sweeper cron Behavior text (processed=false AND createdAt > NOW() - 24h AND (processingStartedAt IS NULL OR processingStartedAt < NOW() - 5min)) is canonical; the implementation matches the spec, not the original plan SQL. -
Spec coverage: Invariants 6, 7; State: Stale Claim, Failed; Sweeper cron Behavior
4. New sweeper cron
- New file
apps/api/src/app/api/cron/sweep-stuck-stripe-webhooks/route.ts - Mirror structure of
apps/api/src/app/api/cron/process-failed-transfers/route.tsexactly:withErrorHandlingwrapper- QStash signature verification with
CRON_SECRETBearer fallback runWithAuditContext({ source: 'cron:sweep-stuck-stripe-webhooks' }, ...)
- Per-iteration:
- Call
WebhookEventService.findStuckEvents('stripe', {...}, tx)inside a read transaction - For each row: attempt
claimEventForProcessing(id, 5)— proceeds only if claim count = 1 - Reconstruct
Stripe.Eventfrom the storedpayloadJSON - Call
processClaimedEvent(id, event)imported fromroute.ts - Tally
{ processed, succeeded, failed, skipped }(skipped = claim count was 0, another worker got it)
- Call
- Batch limit: 50 per run. Process sequentially within a run — parallelism would amplify the same connection-pool pressure we're fixing. Sequential-with-cap stays well under the 5-min cron interval at observed handler latencies.
- Returns JSON body with the tally.
- Spec coverage: Contracts → Sweeper cron; Invariant 7 (single in-flight processor)
5. Register the cron
- Modify
apps/api/vercel.json - Append to
cronsarray:{ "path": "/api/cron/sweep-stuck-stripe-webhooks", "schedule": "*/5 * * * *" } - Leave
functions["app/api/stripe/webhooks/route.ts"].maxDurationat 30 — claim is sub-second;after()budget of 30s is more than handlers need - Spec coverage: Sweeper cron schedule
6. Tests
- New unit test
apps/api/tests/unit/services/payment/process-claimed-event.unit.test.ts - Success path → row goes to Processed, side effects dispatched
handleEventthrows → row goes to Failed witherrorMessage- Side-effect throw doesn't re-mark the row (per spec Known Gap #3)
- New API test
apps/api/tests/api/cron/sweep-stuck-stripe-webhooks.api.test.ts - 401 without auth
- QStash signature happy path
- Bearer-secret happy path
- Stuck event (past stale threshold) gets re-processed
- Fresh row (created <5min ago, no claim yet) is left alone
- Row past 24h lookback is skipped (spec Invariant 6)
LIMIThonored (coversfindStuckEventsboundary behavior end-to-end)- Existing webhook tests at
apps/api/tests/api/stripe/*.api.test.tscontinue using__testPrismasynchronous bypass; no changes needed. - Manual test: trigger Stripe CLI burst (
stripe trigger invoice.paid× N) against deployed staging; verify reconciliation script (apps/api/scripts/stripe.ts reconcile-webhooks --staging) shows 100% delivered.
7. Spec follow-up
- After merge: update
docs/features/stripe-webhook-ingress/spec.mdKnown Gap #1 to remove the "current implementation does not match this spec" note. Spec and code now agree. - Update File map to drop the (target) annotation on the sweeper cron, and update the "Background processor (target)" entry to point at the exported helper in
route.tsrather than a separate Processor service file.
Order of merge
All of items 1–6 ship as one PR (per Aaron — bundling avoids partial state). Item 7's spec follow-up appends to the impl PR or commits immediately after merge.
Rollback
- Pre-commit hook caught the rogue agent's earlier attempt at this; no risk of silent regression.
- If the impl PR ships and behaves badly:
- The sweeper cron is additive — disable by removing the entry from
vercel.jsoncrons. - The route refactor can be reverted to
route.tsfrom before the PR. Tests will still pass on the old code. webhook_eventsschema is unchanged, so no migration to undo.
Out of scope
connection_limitURL parameter changes (current sizing intentional perc55e88c63)attachDatabasePoolmigration to PrismaPg adapter (separate effort; known Supavisor leak issue needs investigation first)- Per-event-type handler refactoring (move Stripe API calls outside
handleEvent's transaction). Real win but bigger scope; deferred to a follow-up. - Generalizing the ingress spec to cover Acuity (
source='acuity') — same machinery, different per-event semantics. Deferred until an Acuity spec exists for comparison.
Risk register
after()runs longer thanmaxDuration. Bounded by Vercel; if a handler legitimately needs more than 30s, the row stays atprocessed=falseand the sweeper retakes it. No data loss but extra latency. Mitigate by bumpingmaxDurationinvercel.jsonif observed.- Sweeper races itself. Two concurrent sweeper invocations could both target the same row.
FOR UPDATE SKIP LOCKED+ atomicclaimEventForProcessingprevent double-processing. - Side-effect throws after
markAsProcessed. Per spec Known Gap #3, side-effect failures don't trigger event re-dispatch. Captured to Sentry. This is intentional but worth alerting on if frequency rises. - Test mode skipping the new path. The synchronous
__testPrismabypass means the new claim/after split is not exercised by integration tests. Mitigated by the new unit + API tests in item 6, but worth a manual staging burst before relying on May 1 next month.