Skip to content

Stripe Webhook Ingress — Hardening Plan

Status: done

Implements stripe-webhook-ingress.md Invariants 11–13 plus two non-spec follow-ups (handler-internal architecture refactor, legacy test bypass migration). Closes the hardening divergence noted in Known Gap #1.

Goal

Add safety, performance, and operational hygiene improvements identified during PR #668 review. None changes the external contract; all reduce noise / latency / risk.

Status

  • [x] Item 1 — Parallel dispatchSideEffects (Invariant 13)
  • [x] Item 2 — MAX_RETRY_ATTEMPTS counter on webhook_events rows (Invariant 11)
  • [x] Item 3 — Lease-aware markAsProcessed (Invariant 12)
  • [x] Item 5 — Move Stripe API calls outside handler transactions (no spec change; perf only)
  • [x] Item 6 — Migrate legacy webhook tests off the __testPrisma bypass (no spec change; coverage only)

(Numbering 1, 2, 3, 5, 6 preserved from the post-merge follow-up review for traceability — there is no Item 4 in this plan.)

Driving citations

Work breakdown

1. Parallel dispatchSideEffects

  • Modify apps/api/src/app/api/stripe/webhooks/route.ts:dispatchSideEffects(result) (current sequential await chain over creationSideEffects, activationSideEffects, paymentFailedNotification, paymentReceiptNotification, accountSetupEmail).
  • Convert to Promise.allSettled([...]) over the independent dispatch promises. Each branch keeps its existing .catch(captureError(...)) pattern so a single side-effect failure doesn't poison the others.
  • One concrete dependency to verify: the accountSetupEmail block awaits a supabaseAdmin.auth.admin.generateLink call before sending the email. That internal sequencing stays as-is; only the cross-category awaits become parallel.
  • Test coverage: add a test in apps/api/tests/integration/api/stripe/webhooks/async-flow.integration.test.ts (or a new co-located test) that mocks the side-effect services to record call ordering. Assert all are invoked, ordering is non-sequential (no strict before/after between independent categories), and a throw in one doesn't suppress others.
  • Spec coverage: Invariant 13.

2. MAX_RETRY_ATTEMPTS counter

  • Schema change: add retry_count INT DEFAULT 0 to webhook_events (Prisma migration). Per db-migration skill — use migrate diff, not migrate dev.
  • apps/api/src/services/shared/WebhookEventService.ts:
  • markAsFailed: increment retry_count by 1 in the same UPDATE.
  • findStuckEvents: extend the query with AND retry_count < ${MAX_RETRY_ATTEMPTS} so capped rows drop off the sweeper's view.
  • countExpiredEvents: extend to also count retry_count >= MAX_RETRY_ATTEMPTS rows so the cliff alert covers them too.
  • apps/api/src/app/api/cron/sweep-stuck-stripe-webhooks/route.ts:
  • Add MAX_RETRY_ATTEMPTS constant. Recommend 5 (≈ 25 min of retries before giving up). Include in the audit-context tag for traceability.
  • Sentry message wording on cliff alert should distinguish "past 24h" from "exceeded retry cap" (extra: reason: 'cliff' | 'retry-cap').
  • Test coverage: extend apps/api/tests/services/shared/WebhookEventService.unit.test.ts (delta-based assertions per the existing pattern) and apps/api/tests/api/cron/sweep-stuck-stripe-webhooks.api.test.ts.
  • Spec coverage: Invariant 11; Sweeper cron contract.

3. Lease-aware markAsProcessed

  • Modify apps/api/src/services/shared/WebhookEventService.ts:markAsProcessed:
  • Add a required expectedProcessingStartedAt: Date parameter.
  • Use updateMany with WHERE id = $id AND processed = false AND processing_started_at = $expectedProcessingStartedAt. Returns affected count.
  • If count = 0, the lease was reclaimed by another worker — markAsProcessed MUST throw a typed error (e.g., LeaseLostError). Caller decides whether to log + skip or retry.
  • Modify apps/api/src/app/api/stripe/webhooks/route.ts:processClaimedEvent:
  • Capture processingStartedAt from the claim outcome.
  • Pass it to markAsProcessed. Catch LeaseLostError → log to Sentry as warning (not error), do not crash the after() callback.
  • Apply the same pattern to the sweeper's call path so both producers (live + sweeper) are lease-aware.
  • Test coverage: new unit test cases that simulate a stolen lease and assert markAsProcessed is the no-op.
  • Spec coverage: Invariant 12.

5. Move Stripe API calls outside handler transactions

This item has the largest surface area and is scoped iteratively. Pick one handler per loop iteration, in priority order. Highest-priority handlers are those firing on May 1 billing day.

Priority order (do top-down, one per iteration):

  • [x] 5a. invoice.paidhandleSubscriptionCreate (SubscriptionCreateHandler.ts:48, subscriptions.retrieve)
  • [x] 5b. customer.subscription.updatedhandleSubscriptionUpdated + syncLockoutPriceForSub (StripeWebhookService.ts:737, 1348, 1366 — three Stripe round-trips inside one tx)
  • [x] 5c. payment_intent.succeededhandlePaymentIntentSucceeded (PaymentIntentHandler.ts:360, charges.retrieve)
  • [x] 5d. invoice.payment_succeededhandleInvoicePaymentSucceeded (StripeWebhookService.ts:317, invoices.retrieve)
  • [x] 5e. charge.succeeded / charge.updatedhandleChargeUpdated (StripeWebhookService.ts:1453, 1459, 1464, 1470 — multiple Stripe calls)
  • [ ] 5f. Remaining handlers — audit and refactor as a final sweep, or leave if low-traffic and Stripe-call-free.

For each handler:

  • Extract the Stripe-data fetches into a "prepare phase" that runs outside withTransaction. Returns a typed bundle of pre-fetched Stripe objects.
  • The transaction now opens with that bundle in scope and only does DB reads + writes. Connection-hold time drops to milliseconds for typical cases.
  • Some handlers conditionally call Stripe based on DB state — those need a "fetch DB state first (read tx, fast), then fetch Stripe (no tx), then write tx" three-phase split. Document that pattern in the first refactored handler and reuse.
  • Risk: handler logic changes are easy to subtly break. Each item ships with its existing behavioral test green, plus at least one new test asserting Stripe client is invoked outside the tx (mock the Prisma $transaction and assert call ordering).

6. Migrate legacy webhook tests off __testPrisma

  • Identify tests under apps/api/tests/api/stripe/*.api.test.ts and apps/api/tests/integration/stripe/ that use createTestRequest (which injects __testPrisma) or the older bypass paths.
  • For each: switch to the integration-test pattern from apps/api/tests/integration/api/stripe/webhooks/async-flow.integration.test.ts — construct request via new Request(), mock only StripeWebhookService.handleEvent per-test, exercise the production code path, assert via DB reads.
  • Migrate opportunistically, not in one pass. Each migration is one or two tests at a time. Loop iterations may convert 3–10 tests per invocation.
  • Stop migrating when test count via the bypass drops below ~50 (long-tail tests with awkward shapes can stay until they're naturally edited).

Order of merge

All five items ship as one PR (matching the prior PR's discipline — single coherent unit). Each loop iteration handles one checkbox; the PR grows incrementally as iterations complete.

Item 5 is iterative within itself (5a–5f). Each sub-item is one loop iteration. The plan considers Item 5 done when 5a–5e are complete; 5f is best-effort.

Rollback

  • Schema migration (Item 2) is additive (retry_count INT DEFAULT 0) — no rollback risk.
  • Items 1, 3, 5 are pure code changes, revertable per file.
  • Item 6 changes test files only — never affects production behavior.

Out of scope

  • The attachDatabasePool migration (still blocked on Supavisor leak issue: Supabase Discussion #40671).
  • Side-effect failure recovery during third-party outages (Twilio/Resend) — Known Gap #3, separate design effort, needs its own spec/plan.
  • Generalizing the ingress spec to cover Acuity (Known Gap #5) — wait for an Acuity-specific reason.
  • Per-event-type domain specs (Known Gap #2) — Item 5 here is a refactor for perf, not a per-handler spec.

Risk register

  • Item 2 schema migration applied out-of-order across long-running branches — per memory note on migration safety. Use migrate diff, deploy promptly.
  • Item 3 LeaseLostError in production = silent skipping of work — must Sentry-warn so we notice if it fires often (would indicate a pattern bug).
  • Item 5 handler refactor regressing handler behavior — biggest risk. Each sub-item must keep its existing tests green; new test asserts call ordering.
  • Item 6 mass-migration creates merge conflicts — keep migrations small (1–3 tests per iteration).