From LLM Demo to Healthcare AI Agent: What Developers Need to Build Around the Model
Building an AI agent demo is easy.
Building a healthcare AI agent that can survive production is a different problem.
A simple prototype might only need:
- A chat UI
- An LLM API
- A prompt
- A response stream
- Maybe a basic database
That is enough to show the concept.
But if the system touches healthcare workflows, patient information, clinical documentation, scheduling, billing, intake, insurance, or EHR data, the architecture changes completely.
At that point, the model is no longer the product. The system around the model becomes the product.
This post breaks down the layers developers should think about before turning an LLM prototype into a healthcare AI agent.
Disclaimer: This is a technical architecture overview, not legal advice. Healthcare products that handle PHI should go through proper compliance, security, and legal review.
1. Start with the data flow, not the model
Most teams start with this question:
Which model should we use?
For healthcare AI, a better first question is:
What sensitive data enters the system, where does it go, and who can access it?
Before writing production code, map the full data flow:
User input
-> API gateway
-> authentication / authorization
-> PHI filtering or classification
-> retrieval layer
-> prompt construction
-> model call
-> response validation
-> audit logging
-> human review
-> downstream system or EHR integration
If protected health information enters the workflow, it may appear in more places than expected:
- Request payloads
- Prompts
- Model responses
- Logs
- Traces
- Embeddings
- Vector databases
- Monitoring tools
- Support tickets
- Analytics dashboards
- Notification systems
- EHR integration payloads
A secure database does not help much if PHI leaks into logs or third-party monitoring tools.
2. Treat PHI as a boundary problem
A useful way to think about healthcare AI architecture is to draw a PHI boundary.
Ask:
Where can PHI enter?
Where can PHI be stored?
Where can PHI be transformed?
Where can PHI leave the system?
Which vendors touch it?
Which users can view it?
Which logs may contain it?
Then design controls around those boundaries.
For example:
Patient message contains PHI
-> Classify input
-> Remove PHI from non-essential logs
-> Restrict access by role
-> Store encrypted
-> Send only allowed fields to model/vendor
-> Record audit event
This sounds like extra work, but it prevents expensive rework later. The worst time to discover your logs contain PHI is after the system is live.
3. Add authorization before retrieval, not after generation
A common mistake in RAG-based healthcare systems is retrieving first and filtering later.
That can create accidental exposure.
Bad pattern:
User asks question
-> Retrieve all relevant documents
-> Send retrieved context to model
-> Filter response
Better pattern:
User asks question
-> Identify user role and permissions
-> Retrieve only allowed documents
-> Build prompt from permitted context
-> Generate response
-> Validate output
-> Log source references
RAG in healthcare is not just about retrieval quality. It is about permissioned retrieval. A patient, physician, billing staff member, front-desk user, and admin should not automatically retrieve from the same knowledge base.
You may need separate indexes, metadata filters, tenant boundaries, document-level permissions, or access-control checks before retrieval.
Example retrieval filter:
{
"tenant_id": "clinic_123",
"user_role": "billing_staff",
"allowed_document_types": ["billing_policy", "insurance_workflow"],
"excluded_document_types": ["clinical_note", "diagnosis_summary"]
}
The exact implementation depends on your stack, but the principle is the same:
Do not give the model context the user should not have.
4. Audit logs are not optional in serious workflows
In a normal chatbot, logs are mostly for debugging.
In healthcare AI, logs are part of accountability.
You may need to answer questions like:
- Who used the AI agent?
- What did they ask?
- What data sources were retrieved?
- What did the model return?
- Was the output edited?
- Who approved it?
- Was anything sent to another system?
- What happened when the model failed?
A basic audit event might look like this:
{
"event_type": "ai_agent_response_generated",
"timestamp": "2026-07-02T14:25:00Z",
"user_id": "user_789",
"tenant_id": "clinic_123",
"user_role": "care_coordinator",
"workflow": "patient_intake_summary",
"model": "llm-provider-model",
"retrieved_sources": [
"intake_form_456",
"clinic_policy_112"
],
"phi_in_prompt": true,
"human_review_required": true,
"status": "pending_review"
}
The goal is to not store unnecessary sensitive data. The goal is to create enough traceability to understand what happened later.
Audit logs should be designed intentionally. Do not just dump full prompts and responses into application logs without thinking through PHI exposure.
5. Human review should be part of the workflow
Developers often think of human review as a product feature.
In healthcare AI, it is also a risk-control layer. For low-risk administrative tasks, the AI may be allowed to suggest or draft. For higher-risk workflows, it may need approval before anything is sent, stored, or acted on.
A simple workflow pattern:
AI generates draft
-> confidence / risk check
-> human review required?
-> yes: send to review queue
-> no: allow next workflow step
-> reviewer edits or approves
-> final action logged
Examples where human review may be needed:
- Clinical summaries
- Medical documentation
- Prior authorization support
- Patient-facing medical guidance
- Billing or coding recommendations
- Anything written back to an EHR
Even when the AI output is useful, the system should make it clear when a human is still accountable.
6. EHR/FHIR integration changes the difficulty level
A standalone AI assistant is one project. An AI agent connected to EHR data is another.
Once you integrate with clinical or administrative systems, you need to think about:
- Authentication
- Patient matching
- Field-level permissions
- Read vs write access
- FHIR resource mapping
- Rate limits
- Failure handling
- Data sync timing
- Duplicate records
- Auditability
- Rollback or correction workflows
A basic architecture might look like:
AI agent
-> Backend service
-> Integration service
-> FHIR API / EHR connector
-> Audit log
-> Review queue
The integration service should not be an afterthought. It should enforce permissions, log events, validate payloads, and isolate external system complexity from the AI layer.
7. Monitoring needs to cover more than uptime
Production AI monitoring is not just server monitoring.
For healthcare AI agents, you may need to monitor:
- Latency
- Token usage
- Failed model calls
- Retrieval quality
- Hallucination reports
- Unsafe outputs
- PHI leakage risk
- User overrides
- Reviewer rejection rate
- Source citation quality
- Drift in responses over time
For example, if reviewers frequently edit or reject AI-generated summaries, that is an important signal.
It may mean:
- Prompts need improvement
- Source documents are poor
- Retrieval is weak
- User expectations are unclear
- The workflow is too risky for automation
AI monitoring should connect technical metrics with workflow outcomes.
8. Do not estimate cost from the model layer alone
A common early estimate looks like this:
Frontend: small
Backend: small
LLM API: manageable
Prompting: manageable
Then production requirements appear:
RBAC
MFA
audit logs
PHI-safe logging
RAG permissioning
vendor review
BAA planning
EHR/FHIR integration
human review workflows
monitoring
security testing
compliance documentation
cloud infrastructure
incident response planning
That is where the real cost starts.
The model may be the visible part, but the control layers usually determine whether the product can be launched in a healthcare environment.
9. A practical developer checklist
Before building a healthcare AI agent, answer these questions:
Data
- What data enters the system?
- Does it include PHI?
- Where is it stored?
- Is it encrypted?
- How long is it retained?
Model
- What data is sent to the model?
- Are prompts logged?
- Are responses logged?
- Is the vendor allowed to process the data?
- Is a BAA required?
Retrieval
- What sources can the AI retrieve from?
- Is retrieval filtered by role?
- Are source citations preserved?
- Can users access data they should not?
Access control
- Who can use the agent?
- What roles exist?
- What can each role see or do?
- Is access reviewed periodically?
Auditability
- Can you reconstruct what happened?
- Are actions tied to users?
- Are AI-generated outputs traceable?
- Are human edits tracked?
Human review
- Which outputs require approval?
- Who approves them?
- What happens when output is rejected?
- Is the final decision logged?
Integration
- What systems does the agent connect to?
- Does it read only or write back?
- How are failures handled?
- Are integration events logged?
Monitoring
- What model behavior is monitored?
- How are unsafe outputs reported?
- How is drift detected?
- Who owns the ongoing review?
Final thought
A healthcare AI agent is not just an LLM with a medical prompt. It is a secure workflow system around a model.
The real engineering work is often in the parts users do not see:
- Access control
- Auditability
- Data boundaries
- Retrieval permissions
- Monitoring
- Human review
- Integration reliability
- Compliance-ready architecture
That is why the cost of healthcare AI development is usually not just the cost of model integration. It is the cost of building the system that makes the model usable in a regulated environment.
I wrote a deeper cost breakdown here covering HIPAA-compliant AI agents, RAG architecture, EHR/FHIR integration, infrastructure, compliance controls, hidden costs, and build-vs-buy planning.













