Large language models remain confident even when they are incorrect. Ask a general purpose model about an internal leave policy, a recent set of financial figures, or a runbook for a service that did not exist at training time, and the response will sound authoritative while frequently being wrong. That gap is the primary reason many internal AI assistants never move beyond a prototype.
Retrieval Augmented Generation, commonly referred to as RAG, addresses this directly. Rather than relying on the model to have memorised your data, the relevant portions of your own content are retrieved at query time and supplied to the model as context. The model then answers from the material provided and can cite the source of each fact. The outcome is an assistant that is grounded, current, and considerably easier to trust.
Having implemented several of these systems, I have found that the foundation model is rarely the bottleneck. The surrounding infrastructure is: the vector store to provision, the embeddings to manage, the chunking strategy to tune, and the connectors to maintain. Amazon Bedrock has steadily reduced this burden, and on 17 June 2026 the new Managed Knowledge Base became generally available, removing most of it. This is therefore a suitable moment to review the complete build as it would be deployed in practice.
What we are building
The objective is a chatbot that accepts a question over HTTPS, retrieves the most relevant passages from a private document set, asks a foundation model to answer using only those passages, and returns the response with citations. There are no servers to patch, no idle compute, and a cost profile that follows usage rather than uptime.
The entire system is serverless. API Gateway and Lambda handle the request, Bedrock performs retrieval and generation, and the source documents reside in S3.
Each component scales independently, and most scale to zero. If no questions are asked for a week, the only ongoing charge is the modest cost of the indexed data in storage. This economic profile is significant when deploying an internal tool that may serve fifty users rather than fifty thousand.
Why Bedrock, and why serverless
It is entirely possible to build this independently: provision OpenSearch or pgvector, run embedding jobs, implement the retrieval loop, and integrate a model. That approach works, but it leaves you operating a small distributed system whose sole purpose is to supply text to a language model. For most teams, this is undifferentiated effort.
Bedrock provides the foundation models, the managed retrieval layer, and a single set of APIs to drive both. Building on top of it with API Gateway and Lambda removes instance sizing, autoscaling configuration, and idle overnight capacity. For an internal assistant with bursty and unpredictable traffic, this is the appropriate architecture.
How RAG works
Before reviewing the code, it is worth being precise about what occurs on every request, because understanding the six steps allows cost, latency, and accuracy to be reasoned about separately.
Retrieval and generation are distinct problems. The majority of accuracy issues originate in steps two and three rather than in the model.
The question arrives, is converted into a vector embedding, and that vector is used to retrieve the closest matching passages from the index. Those passages are incorporated into the prompt, and only then does the foundation model generate a response. The key point is that the model is the final step rather than the central one. When a RAG system produces a poor answer, the cause is usually weak retrieval rather than the model itself. That is where tuning effort should be directed.
Step 1: Store documents in S3
This step is deliberately simple. Create a bucket, upload your PDFs, Markdown files, HTML exports, or Word documents, and the task is complete.
aws s3 mb s3://my-company-knowledge --region us-east-1
aws s3 sync ./docs s3://my-company-knowledge/docs/
To ensure clean citations, keep filenames human readable. A passage that traces back to leave-policy-2026.pdf is considerably more useful in an answer than one referencing doc_final_v3_FINAL.pdf.
Step 2: Create the Knowledge Base
This is the step that previously constituted a project in its own right. With Managed Knowledge Base, open the Bedrock console, navigate to Knowledge Bases, and select Create Managed KB. Choose a connector from the dropdown; at launch there are six native options: Amazon S3, SharePoint, Confluence, Google Drive, OneDrive, and a Web Crawler. The IAM role is created automatically and can be edited if your security requirements demand it.
Behind that interface, three capabilities now operate on your behalf that previously required manual configuration.
The left column represents work formerly required on every project. The right column represents what the managed service now absorbs.
Smart Parsing determines how to interpret each file type, so a table within a PDF and a page from the web crawler are processed differently and correctly, without manual parser configuration. The service also selects and manages a default embedding model, a re-ranker, and a generation model, enabling deployment in minutes rather than weeks. The Agentic Retriever handles complex queries: when a question requires several chained lookups, it plans those steps, executes them, evaluates the intermediate results, and concludes once sufficient context has been gathered. This last capability is genuinely new, as multi-hop retrieval was previously orchestration code that had to be written and maintained.
The flexibility that makes Bedrock valuable is retained. Any foundation model on Bedrock can power the generation step, and embedding or re-ranking models can be changed later to balance accuracy against cost. If you already call the existing Knowledge Base APIs such as Retrieve and IngestKnowledgeBaseDocuments, no code changes are required; you simply reference the new knowledge base ID.
Once the source has synced, copy the knowledge base ID. It is the only piece of state the application requires.
Step 3: The Lambda function
This is where the architecture demonstrates its value. The complete retrieval and generation flow is a single API call, retrieve_and_generate, against the bedrock-agent-runtime client. Bedrock embeds the question, searches the index, constructs the augmented prompt, invokes the model, and returns the answer together with citations.
import json
import boto3
agent_runtime = boto3.client("bedrock-agent-runtime")
KB_ID = "YOUR_KNOWLEDGE_BASE_ID"
MODEL_ARN = (
"arn:aws:bedrock:us-east-1::foundation-model/"
"anthropic.claude-3-5-sonnet-20241022-v2:0"
)
def handler(event, context):
body = json.loads(event.get("body") or "{}")
question = body.get("question", "").strip()
session_id = body.get("sessionId") # optional, for multi turn
if not question:
return _response(400, {"error": "question is required"})
kwargs = {
"input": {"text": question},
"retrieveAndGenerateConfiguration": {
"type": "KNOWLEDGE_BASE",
"knowledgeBaseConfiguration": {
"knowledgeBaseId": KB_ID,
"modelArn": MODEL_ARN,
"retrievalConfiguration": {
"vectorSearchConfiguration": {"numberOfResults": 5}
},
},
},
}
# carry the session forward so follow up questions keep context
if session_id:
kwargs["sessionId"] = session_id
result = agent_runtime.retrieve_and_generate(**kwargs)
answer = result["output"]["text"]
citations = _extract_sources(result.get("citations", []))
return _response(200, {
"answer": answer,
"sessionId": result.get("sessionId"),
"sources": citations,
})
def _extract_sources(citations):
sources = []
for citation in citations:
for ref in citation.get("retrievedReferences", []):
location = ref.get("location", {})
uri = location.get("s3Location", {}).get("uri")
if uri and uri not in sources:
sources.append(uri)
return sources
def _response(status, payload):
return {
"statusCode": status,
"headers": {
"Content-Type": "application/json",
"Access-Control-Allow-Origin": "*",
},
"body": json.dumps(payload),
}
Several details in the handler merit attention. The numberOfResults value controls how many passages are included in the context and is the first parameter to adjust when answers are insufficient or, conversely, when they lose focus. The sessionId returned by Bedrock enables conversational behaviour: returning it on the subsequent call allows the model to maintain context, so a user may ask a follow up question such as "and what about contractors?" without repeating earlier information.
For this to operate, the Lambda execution role requires permission to invoke the model and query the knowledge base. The policy should remain tightly scoped:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "bedrock:RetrieveAndGenerate",
"Resource": "*"
},
{
"Effect": "Allow",
"Action": "bedrock:InvokeModel",
"Resource": "arn:aws:bedrock:us-east-1::foundation-model/*"
}
]
}
Step 4: Expose it through API Gateway
Connect the Lambda function to an HTTP API with a single POST route and a Lambda proxy integration. This provides a public HTTPS endpoint that the frontend can call.
aws apigatewayv2 create-api \
--name rag-chatbot \
--protocol-type HTTP \
--target arn:aws:lambda:us-east-1:ACCOUNT_ID:function:rag-handler
From the browser, the endpoint is called with a standard fetch request. Send the question and store the returned sessionId so that the next message continues the conversation.
async function ask(question, sessionId) {
const res = await fetch("https://YOUR_API.execute-api.us-east-1.amazonaws.com/ask", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ question, sessionId }),
});
return res.json(); // { answer, sessionId, sources }
}
That constitutes the entire application: a bucket, a knowledge base, one Lambda function, and one route.
Guardrails
Grounding the model in your data reduces hallucination, but it does not address every risk. If the system will be used by real users, attach a Bedrock Guardrail to filter prompt injection attempts, restrict topics that should not be discussed, and redact sensitive data such as account numbers before it reaches the model. This is a configuration step rather than code, and on an enterprise deployment it should be treated as mandatory rather than optional.
Cost considerations
The system is inexpensive to operate and close to free when idle, which is the central advantage of a serverless design. With Managed Knowledge Base, charges apply on two dimensions: the volume of indexed data retained, and the number of retrievals performed on demand. In addition, standard foundation model token costs apply to each generation, along with the small per request charges for Lambda and API Gateway.
In practice, model token costs dominate the bill once meaningful traffic arrives, while indexed data represents a steady and predictable line item. The value to monitor is numberOfResults together with passage size, since additional retrieved context increases input tokens on every request. New AWS accounts also receive a Bedrock free tier, which is sufficient for building and testing the system before production. Consult the Bedrock pricing page for current figures in your region, as these values change frequently.
Common pitfalls
A number of issues commonly cause problems.
Retrieval quality is determined at ingestion rather than at query time. If answers are vague, the remedy is usually found upstream in how documents were parsed and chunked rather than in the prompt. Smart Parsing improves this considerably, but poor source material still produces poor results.
Citations are a feature and should be used. Displaying the source file in the interface converts an opaque response into one the user can verify, and that verifiability often determines whether a tool is adopted or quietly abandoned.
Region availability is not yet universal. At launch, Managed Knowledge Base is available in a defined set of regions including US East and West, Sydney, Tokyo, Dublin, Frankfurt, London, and GovCloud West. Confirm availability before assuming that your preferred region is supported, particularly for deployments outside North America and Europe.
Keep the model selection flexible. A notable strength of this architecture is that the generation model is simply an ARN. When an improved or more economical model becomes available, a single string is changed and the system is redeployed. Avoid embedding assumptions about a specific model elsewhere in the system.
Conclusion
A grounded, serverless chatbot operating over private documents was previously a multi week undertaking that included a vector database requiring ongoing operation. In 2026 it consists of a bucket, a managed knowledge base, and a single Lambda call. The infrastructure has moved into the platform, which allows development effort to focus on what matters most: the quality of the data and the experience presented to users.
For a first implementation, begin with a narrow scope. Point the system at a single, well organised document set, refine retrieval until it is accurate, add guardrails, and only then expand the connectors. A focused system that is correct earns trust far more quickly than a broad one that is merely plausible.















