Everyone ships demos. Localhost is kind — it has no users, no retries, no race conditions, no 3am failures. Production is where your assumptions get stress-tested by reality.
I built and deployed a full hospital management system: FastAPI backend, React web frontend, Flutter mobile app, Supabase PostgreSQL, Razorpay payments, Redis queues, and WhatsApp notifications via Selenium. Here's the post-mortem on what actually broke and how I fixed it.
1. Razorpay webhook idempotency — naive payment handling double-charges users
Razorpay retries failed webhooks. So does every serious payment processor. If your handler isn't idempotent, a network blip between their retry and your database write means you process the same event twice. The user gets charged once. Your DB thinks they paid twice. Chaos.
The fix isn't clever code — it's a database constraint.
I added a deduplication_key column on the payment_webhooks table with a unique index. Every webhook payload carries a Razorpay event ID. On arrival: insert it, process it, or fail silently if the key already exists. No conditionals, no if already_processed. The database enforces it.
CREATE UNIQUE INDEX idx_payment_webhooks_dedup
ON payment_webhooks(deduplication_key);
Payment logic should be a pure function of an event ID. If you've seen it before, return 200 and do nothing. That's it. Any other approach is a liability waiting to surface at the worst possible time.
2. WhatsApp automation via Selenium — why the official API isn't always the answer
The Meta Business API requires a verified business, a WABA account, approved message templates, and per-message fees. For a hospital running 40 bookings a day in India, the economics don't work. So I automated WhatsApp Web via Selenium.
Here's what breaks in production: QR code expiry, DOM changes after WhatsApp Web updates, headless Chrome memory leaks, and silent rate limiting from Meta's side if you push volume.
I fixed it with three things:
- Session persistence — Selenium reuses a Chrome user profile that stays logged in. No re-scanning QR codes on every restart.
- Dead-man's switch — a watchdog process that kills and restarts the driver if a successful send hasn't been confirmed in 10 minutes.
-
Audit trail — every outbound message is written to a
whatsapp_logstable with status (sent,failed,retrying). Hospitals are accountable institutions. You need the record.
The Selenium approach is fragile by nature. You accept that fragility and build resilience around it — you don't pretend it's stable infrastructure.
3. Redis async task queue — why synchronous notification dispatch blocks everything
First version of the booking flow: confirmation saved → Selenium sends WhatsApp → API returns response. Selenium takes 3–8 seconds to open a conversation thread and send a message. Your API is blocking that entire time. Under any load, requests pile up, timeouts compound, and a booking endpoint that should return in 200ms becomes a 9-second operation.
Decoupling is the only fix.
The booking endpoint now does one thing: write a job to a Redis queue, then return 200. A separate background worker consumes the queue, handles Selenium, retries on failure, updates whatsapp_logs. The API doesn't care about notification state.
# Endpoint — fast, non-blocking
redis_client.lpush("notification_queue", json.dumps(job_payload))
return {"status": "confirmed"}
# Worker — slow, isolated, retryable
while True:
job = redis_client.brpop("notification_queue", timeout=5)
dispatch_whatsapp(job)
Response time went from 5–9 seconds to under 300ms. The booking logic was always fast. I was just carrying unnecessary weight in the hot path.
4. Multi-role JWT + Supabase RLS — the mistake most tutorials make
The common tutorial pattern: encode the user's role in the JWT payload, check it in your route handler, gate the endpoint. Looks fine. It's not.
A stale token carries stale permissions. A doctor whose account was suspended can still read patient data until their token expires. In a healthcare system, that's not acceptable.
I encode only a user_id in the JWT. On every request, FastAPI fetches the current role from the database — live, not cached in the token. Supabase Row Level Security then enforces data access at the storage layer, independent of the application layer.
# Dependency — resolves role fresh on every request
async def get_current_user(token: str = Depends(oauth2_scheme)):
user_id = decode_jwt(token)
return await db.fetch_one("SELECT * FROM users WHERE id = $1", user_id)
Three independent enforcement layers: JWT for identity, DB lookup for current role, RLS for data access. Any one of them catches what the others miss. That's defense in depth — not just a phrase, an actual architectural decision.
5. Flutter + React + FastAPI — keeping two frontends honest against one database
The failure mode looks subtle at first. Mobile app caches an appointment as confirmed. Web dashboard still shows it as pending. A doctor sees one state. The patient sees another. Support calls happen. Trust erodes.
Two frontends reading the same data isn't inherently a distributed systems problem — until you let each client own its own state. Then it is.
Supabase is the single source of truth. Not an API response, not local state in either client. Both Flutter and React read from the same PostgreSQL database through the same FastAPI layer. Neither client writes directly to Supabase outside of realtime subscriptions for live updates.
The rule I enforced: every mutation goes through the API. No client-side optimistic writes that don't have a confirmed server echo. No divergent data-fetching logic between platforms. One schema, one API contract, two consumers.
When you have two frontends, consistency isn't a feature you add later. It's a constraint you build into your data layer from the start.
The honest lesson
Production doesn't care about your architecture diagrams.
It cares whether your payment handler is idempotent at 2am when Razorpay fires a retry. It cares whether your notification worker recovers without a manual restart. It cares whether a role change propagates before the next API call hits your route guard.
Demos prove that something can work. Production proves that it does — repeatedly, under pressure, without you watching. That's a completely different engineering problem, and most systems only find out the difference after they've already failed.













