I shipped a hospital management system to production. Here's what actually broke.

Everyone ships demos. Localhost is kind — it has no users, no retries, no race conditions, no 3am failures. Production is where your assumptions get stress-tested by reality.

I built and deployed a full hospital management system: FastAPI backend, React web frontend, Flutter mobile app, Supabase PostgreSQL, Razorpay payments, Redis queues, and WhatsApp notifications via Selenium. Here's the post-mortem on what actually broke and how I fixed it.

1. Razorpay webhook idempotency — naive payment handling double-charges users

Razorpay retries failed webhooks. So does every serious payment processor. If your handler isn't idempotent, a network blip between their retry and your database write means you process the same event twice. The user gets charged once. Your DB thinks they paid twice. Chaos.

The fix isn't clever code — it's a database constraint.

I added a deduplication_key column on the payment_webhooks table with a unique index. Every webhook payload carries a Razorpay event ID. On arrival: insert it, process it, or fail silently if the key already exists. No conditionals, no if already_processed. The database enforces it.

CREATE UNIQUE INDEX idx_payment_webhooks_dedup 
ON payment_webhooks(deduplication_key);

Payment logic should be a pure function of an event ID. If you've seen it before, return 200 and do nothing. That's it. Any other approach is a liability waiting to surface at the worst possible time.

2. WhatsApp automation via Selenium — why the official API isn't always the answer

The Meta Business API requires a verified business, a WABA account, approved message templates, and per-message fees. For a hospital running 40 bookings a day in India, the economics don't work. So I automated WhatsApp Web via Selenium.

Here's what breaks in production: QR code expiry, DOM changes after WhatsApp Web updates, headless Chrome memory leaks, and silent rate limiting from Meta's side if you push volume.

I fixed it with three things:

Session persistence — Selenium reuses a Chrome user profile that stays logged in. No re-scanning QR codes on every restart.
Dead-man's switch — a watchdog process that kills and restarts the driver if a successful send hasn't been confirmed in 10 minutes.
Audit trail — every outbound message is written to a whatsapp_logs table with status (sent, failed, retrying). Hospitals are accountable institutions. You need the record.

The Selenium approach is fragile by nature. You accept that fragility and build resilience around it — you don't pretend it's stable infrastructure.

3. Redis async task queue — why synchronous notification dispatch blocks everything

First version of the booking flow: confirmation saved → Selenium sends WhatsApp → API returns response. Selenium takes 3–8 seconds to open a conversation thread and send a message. Your API is blocking that entire time. Under any load, requests pile up, timeouts compound, and a booking endpoint that should return in 200ms becomes a 9-second operation.

Decoupling is the only fix.

The booking endpoint now does one thing: write a job to a Redis queue, then return 200. A separate background worker consumes the queue, handles Selenium, retries on failure, updates whatsapp_logs. The API doesn't care about notification state.

# Endpoint — fast, non-blocking
redis_client.lpush("notification_queue", json.dumps(job_payload))
return {"status": "confirmed"}

# Worker — slow, isolated, retryable
while True:
    job = redis_client.brpop("notification_queue", timeout=5)
    dispatch_whatsapp(job)

Response time went from 5–9 seconds to under 300ms. The booking logic was always fast. I was just carrying unnecessary weight in the hot path.

4. Multi-role JWT + Supabase RLS — the mistake most tutorials make

The common tutorial pattern: encode the user's role in the JWT payload, check it in your route handler, gate the endpoint. Looks fine. It's not.

A stale token carries stale permissions. A doctor whose account was suspended can still read patient data until their token expires. In a healthcare system, that's not acceptable.

I encode only a user_id in the JWT. On every request, FastAPI fetches the current role from the database — live, not cached in the token. Supabase Row Level Security then enforces data access at the storage layer, independent of the application layer.

# Dependency — resolves role fresh on every request
async def get_current_user(token: str = Depends(oauth2_scheme)):
    user_id = decode_jwt(token)
    return await db.fetch_one("SELECT * FROM users WHERE id = $1", user_id)

Three independent enforcement layers: JWT for identity, DB lookup for current role, RLS for data access. Any one of them catches what the others miss. That's defense in depth — not just a phrase, an actual architectural decision.

5. Flutter + React + FastAPI — keeping two frontends honest against one database

The failure mode looks subtle at first. Mobile app caches an appointment as confirmed. Web dashboard still shows it as pending. A doctor sees one state. The patient sees another. Support calls happen. Trust erodes.

Two frontends reading the same data isn't inherently a distributed systems problem — until you let each client own its own state. Then it is.

Supabase is the single source of truth. Not an API response, not local state in either client. Both Flutter and React read from the same PostgreSQL database through the same FastAPI layer. Neither client writes directly to Supabase outside of realtime subscriptions for live updates.

The rule I enforced: every mutation goes through the API. No client-side optimistic writes that don't have a confirmed server echo. No divergent data-fetching logic between platforms. One schema, one API contract, two consumers.

When you have two frontends, consistency isn't a feature you add later. It's a constraint you build into your data layer from the start.

The honest lesson

Production doesn't care about your architecture diagrams.

It cares whether your payment handler is idempotent at 2am when Razorpay fires a retry. It cares whether your notification worker recovers without a manual restart. It cares whether a role change propagates before the next API call hits your route guard.

Demos prove that something can work. Production proves that it does — repeatedly, under pressure, without you watching. That's a completely different engineering problem, and most systems only find out the difference after they've already failed.

I shipped a hospital management system to production. Here's what actually broke.

1. Razorpay webhook idempotency — naive payment handling double-charges users

2. WhatsApp automation via Selenium — why the official API isn't always the answer

3. Redis async task queue — why synchronous notification dispatch blocks everything

4. Multi-role JWT + Supabase RLS — the mistake most tutorials make

5. Flutter + React + FastAPI — keeping two frontends honest against one database

The honest lesson

Tags

Author

Stats

Published

You Might Also Like

Managing 150+ AI Agent Skills at Scale — What Broke, What I Built

Amazon Bedrock AgentCore Harness runs your agent. ShapeV2 controls what it's allowed to do

GCP in Action: Migrating a LINE Bot from AI Studio to Vertex AI to Solve 429 Errors

I Gave My SEO Agent a Real Site. It Found Bugs I'd Missed for Weeks.

How I Built an Offline AI Assistant in Python - No OpenAI, No LangChain, No Dependencies

I Built a Tool That Detects SEO Poisoning Across Multiple Search Engines