Skip to content

ADR-007: Replace Stripe FDW with API-Driven Sync Pipeline

Status

Accepted

Context

The bookkeeping ledger views depend on joining Stripe data across balance_transactions, charges, payment_intents, invoices, refunds, and transfers. These are accessed via a Supabase Stripe Foreign Data Wrapper (FDW) that queries the Stripe API live on every SQL query.

Two problems surfaced:

  1. The FDW is incompatible with Stripe's basil API version. Our Stripe account uses 2025-06-30.basil, which removed charge.invoice, invoice.payment_intent, invoice.charge, and payment_intent.invoice — the canonical fields for linking charges to invoices. The FDW has not been updated for basil (last Stripe-specific code change was March 2025). These columns silently return NULL, forcing us to join through payment_details.order_reference — a field Stripe explicitly truncates to 25 characters for L2/L3 card data and documents as non-canonical.

  2. The FDW is architecturally wrong for analytics. Every query hits the live Stripe API. Materialized view refreshes are slow and subject to rate limits. No local indexes. Dashboard queries depend on external API availability.

The Supabase Stripe Sync Engine (webhook-driven) was evaluated but is missing balance_transactions and transfers — the two most critical objects for bookkeeping.

Decision

Replace the Stripe FDW with a periodic API-driven sync pipeline that writes to local Postgres tables in the stripe schema.

Architecture — three layers:

  • Layer 0 (Sync): Scripts fetch from Stripe API endpoints, store raw responses in local tables. Each table has typed columns for indexed/filterable fields and an attrs jsonb column containing the complete, unaltered API response. No transformation. The sync controls which API version it requests — pre-basil for objects where basil removed fields we need, or basil where the new structure is preferred.
  • Layer 1 (Materialized view): Same as today — joins the stripe.* tables into analytics.bookkeeping_stripe_ledger. The SQL changes minimally because the table shape is the same as the FDW, just with data that actually populates the canonical join fields (charge.invoice, etc.).
  • Layer 2+ (Views): Unchanged.

Sync strategy:

  • Initial backfill: paginate through full history using Stripe list endpoints
  • Incremental: fetch records with created[gte] since last sync timestamp
  • Cadence: daily cron (bookkeeping doesn't need real-time)
  • Objects synced: balance_transactions, charges, payment_intents, invoices, refunds, transfers, payouts

Alternative considered: Pin the FDW to a pre-basil API version (2024-06-20). The FDW supports an api_version server option. This would restore charge.invoice etc. Rejected because it's technical debt with an expiration date — Stripe will eventually drop old API versions, and we'd still have the live-API-on-every-query problem.

Alternative considered: Supabase Stripe Sync Engine. Webhook-driven, near real-time, Postgres-native. Missing balance_transactions and transfers. A hybrid (sync engine + custom scripts for gaps) adds two sync mechanisms with different freshness characteristics. Rejected in favor of one consistent approach.

Alternative considered: Stripe Data Pipeline. First-party, most complete coverage. Does not support Postgres as a destination (Snowflake/Redshift only). Rejected — would require S3 intermediate storage and additional ETL.

Alternative considered: Airbyte. Open-source ETL with Stripe source and Postgres destination. Pinned to API version 2022-11-15 — no basil support, no invoice_payments. Heavy infrastructure for what we need. Rejected.

Consequences

Benefits:

  • Canonical join fields (charge.invoice, invoice.payment_intent) work because we control the API version per request
  • Local tables with proper indexes — dashboards don't depend on external API availability
  • Full raw Stripe response stored in attrs — never locked out of fields when the API evolves
  • Matview refresh queries local data — fast, no rate limits
  • Simple infrastructure — scripts + cron, no new services to deploy
  • Clean layer separation: Layer 0 is pure ingestion, no opinion

Tradeoffs:

  • We build and maintain sync code (pagination, incremental tracking, error handling)
  • Data freshness is daily instead of live — acceptable for bookkeeping, would not be acceptable for real-time application logic
  • Initial backfill may take time for accounts with large history
  • Storage cost for local copies of Stripe data (negligible at our scale)

Risks:

  • Stripe API rate limits during backfill — mitigated by respecting rate limit headers and backing off
  • Schema drift if Stripe changes response shape — mitigated by storing full attrs jsonb (typed columns are a convenience layer, not the source of truth)
  • If we ever need real-time Stripe data for application logic (not just analytics), this pipeline doesn't solve that — we'd still need webhooks or the sync engine for that use case