Knowledge Bases for Amazon Bedrock is a managed retrieval-augmented generation (RAG) service. You point it at a data source (S3, a website, a SaaS connector), pick an embedding model and a vector store, and Bedrock handles ingestion, chunking, embedding, indexing, retrieval, citation tracking, and grounded generation. The result is a single API — retrieve for raw chunks, retrieve_and_generate for a fully-grounded answer — that replaces a meaningful slice of custom RAG plumbing.
A Knowledge Base is a thin orchestrator over four pieces:
An ingestion job walks the data source, parses, chunks, embeds, and writes to the vector store. After ingestion, queries hit the vector store and (optionally) the FM for generation.
doc.pdf.metadata.json).robots.txt; rate-limited to avoid hammering origin sites.IngestKnowledgeBaseDocuments with documents constructed in your own code — useful when the source is a database query or an in-house API.Attach metadata to a chunk to enable filtered retrieval (e.g. only "year=2026" docs). Drop a JSON file next to each source file:
{
"metadataAttributes": {
"year": { "value": { "type": "NUMBER", "numberValue": 2026 } },
"department": { "value": { "type": "STRING", "stringValue": "HR" } },
"tags": { "value": { "type": "STRING_LIST", "stringListValue": ["policy", "leave"] } }
}
}
Filename convention: if the source is policies/2026-leave.pdf, the sidecar is policies/2026-leave.pdf.metadata.json.
For each store you must pre-create the collection/database and pass field-mapping hints (vector field, text field, metadata field) when creating the KB.
amazon.titan-embed-text-v2:0): 1024 default dims (also 512 and 256 via dimensions parameter), normalized output, multilingual. Default and best general-purpose pick on Bedrock.amazon.titan-embed-text-v1): 1536 dims. Older but still serviceable.cohere.embed-multilingual-v3): 1024 dims, optimized for >100 languages. Best when the corpus is non-English or mixed-language.cohere.embed-english-v3): 1024 dims, English-only.Pick the embedding model up front and treat it as immutable — switching models means reindexing the entire corpus. Smaller dimensions (Titan v2 at 256d) cut storage and latency by ~4x with a small recall penalty; worth measuring on your data.
import boto3
agent = boto3.client("bedrock-agent", region_name="us-west-2")
kb = agent.create_knowledge_base(
name="hr-policies",
description="Internal HR policy documents (US, EMEA, APAC).",
roleArn="arn:aws:iam::111111111111:role/BedrockKBRole",
knowledgeBaseConfiguration={
"type": "VECTOR",
"vectorKnowledgeBaseConfiguration": {
"embeddingModelArn": "arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v2:0",
"embeddingModelConfiguration": {
"bedrockEmbeddingModelConfiguration": {
"dimensions": 1024,
"embeddingDataType": "FLOAT32",
}
},
},
},
storageConfiguration={
"type": "OPENSEARCH_SERVERLESS",
"opensearchServerlessConfiguration": {
"collectionArn": "arn:aws:aoss:us-west-2:111111111111:collection/abc123",
"vectorIndexName": "hr-policies-idx",
"fieldMapping": {
"vectorField": "embedding",
"textField": "text",
"metadataField": "metadata",
},
},
},
)
kb_id = kb["knowledgeBase"]["knowledgeBaseId"]
print("KB:", kb_id)
ds = agent.create_data_source(
knowledgeBaseId=kb_id,
name="hr-policies-s3",
dataSourceConfiguration={
"type": "S3",
"s3Configuration": {
"bucketArn": "arn:aws:s3:::company-hr-docs",
"inclusionPrefixes": ["policies/"],
"bucketOwnerAccountId": "111111111111",
},
},
vectorIngestionConfiguration={
"chunkingConfiguration": {
"chunkingStrategy": "HIERARCHICAL",
"hierarchicalChunkingConfiguration": {
"levelConfigurations": [
{"maxTokens": 1500}, # parent
{"maxTokens": 300}, # child
],
"overlapTokens": 60,
},
},
},
)
print("DS:", ds["dataSource"]["dataSourceId"])
Ingestion jobs are async. Trigger one whenever the data source changes; Bedrock detects added/modified/deleted files and updates only the affected chunks (incremental sync).
import time
job = agent.start_ingestion_job(knowledgeBaseId=kb_id, dataSourceId=ds_id)
job_id = job["ingestionJob"]["ingestionJobId"]
while True:
status = agent.get_ingestion_job(
knowledgeBaseId=kb_id, dataSourceId=ds_id, ingestionJobId=job_id,
)["ingestionJob"]
state = status["status"]
print(state, status.get("statistics", {}))
if state in ("COMPLETE", "FAILED", "STOPPED"):
break
time.sleep(10)
The statistics block reports documents scanned, indexed, modified, deleted, and failed — log it to CloudWatch as your ingestion SLO.
Trigger ingestion automatically by wiring an S3 EventBridge rule on Object Created events to a Lambda that calls start_ingestion_job.
retrieve — raw chunks onlyUse this when you want to do your own prompting, rerank with a different model, or display raw search results.
runtime = boto3.client("bedrock-agent-runtime", region_name="us-west-2")
resp = runtime.retrieve(
knowledgeBaseId=kb_id,
retrievalQuery={"text": "How many weeks of parental leave do EMEA employees get?"},
retrievalConfiguration={
"vectorSearchConfiguration": {
"numberOfResults": 5,
"overrideSearchType": "HYBRID", # SEMANTIC | HYBRID
"filter": {
"andAll": [
{"equals": {"key": "department", "value": "HR"}},
{"greaterThan": {"key": "year", "value": 2024}},
]
},
}
},
)
for r in resp["retrievalResults"]:
print(round(r["score"], 3), r["location"], r["content"]["text"][:120])
retrieve_and_generate — grounded answer in one call
resp = runtime.retrieve_and_generate(
input={"text": "How many weeks of parental leave do EMEA employees get?"},
retrieveAndGenerateConfiguration={
"type": "KNOWLEDGE_BASE",
"knowledgeBaseConfiguration": {
"knowledgeBaseId": kb_id,
"modelArn": "arn:aws:bedrock:us-west-2::foundation-model/anthropic.claude-opus-4-7",
"retrievalConfiguration": {
"vectorSearchConfiguration": {"numberOfResults": 8, "overrideSearchType": "HYBRID"}
},
"generationConfiguration": {
"inferenceConfig": {"textInferenceConfig": {
"temperature": 0.0, "maxTokens": 600,
}},
"promptTemplate": {"textPromptTemplate": (
"You are an HR assistant. Answer using ONLY the search results below. "
"If the answer is not present, say 'I don't have that policy on file.'\n\n"
"$search_results$\n\nQuestion: $query$"
)},
},
},
},
)
print(resp["output"]["text"])
Pass sessionId from one call into the next so the KB carries chat context (it rewrites follow-up questions like "what about APAC?" into standalone queries before retrieving).
session_id = resp["sessionId"]
followup = runtime.retrieve_and_generate(
input={"text": "What about APAC?"},
sessionId=session_id,
retrieveAndGenerateConfiguration=resp_config, # same as above
)
Every retrieve_and_generate response includes a citations array that maps spans of the generated text to specific retrieved chunks. Surface these in the UI to let users verify the answer.
text = resp["output"]["text"]
for cite in resp.get("citations", []):
span = cite["generatedResponsePart"]["textResponsePart"]["span"]
quoted = text[span["start"]:span["end"] + 1]
print(f"---\nCLAIM: {quoted}")
for ref in cite["retrievedReferences"]:
loc = ref["location"]
kind = loc["type"]
if kind == "S3":
print(f" source: {loc['s3Location']['uri']}")
elif kind == "WEB":
print(f" source: {loc['webLocation']['url']}")
print(f" chunk: {ref['content']['text'][:160]}...")
Citations are also the raw material for hallucination guardrails — wire them into a contextual-grounding guardrail (see Bedrock Guardrails) to block answers that drift from the cited context.
Default parsing extracts plain text — fine for prose but loses structure in slide decks, tables, and financial PDFs. Enable advanced parsing to use a foundation model to interpret each page as Markdown, preserving tables, headings, and figure captions.
agent.create_data_source(
knowledgeBaseId=kb_id,
name="financial-reports-s3",
dataSourceConfiguration={"type": "S3", "s3Configuration": {
"bucketArn": "arn:aws:s3:::company-finance-docs",
}},
vectorIngestionConfiguration={
"parsingConfiguration": {
"parsingStrategy": "BEDROCK_FOUNDATION_MODEL",
"bedrockFoundationModelConfiguration": {
"modelArn": "arn:aws:bedrock:us-west-2::foundation-model/anthropic.claude-opus-4-7",
"parsingPrompt": {"parsingPromptText": (
"Convert each page to Markdown. Preserve tables as GitHub-flavored "
"Markdown tables. Render figures as '![figure: ]'."
)},
},
},
"chunkingConfiguration": {
"chunkingStrategy": "FIXED_SIZE",
"fixedSizeChunkingConfiguration": {"maxTokens": 500, "overlapPercentage": 15},
},
},
)
FM parsing costs more (one model call per page) and slows ingestion materially. Reserve it for documents where layout actually carries meaning — annual reports, scientific papers, regulatory filings.
retrieve), then run your own prompt construction, reranker, and generation. You get managed ingestion + indexing without giving up prompt control.Knowledge Bases for Bedrock collapse most of the RAG plumbing into a managed service. The trade-off — as always with managed services — is lower flexibility on the retrieval pipeline. Start with the KB; reach for custom RAG only when an evaluation actually fails because of it.
numberOfResults empirically. The default of 5 is fine for narrow corpora; for broad corpora bump to 8–12 and let the FM filter. Each extra result is more tokens in the prompt and a small latency cost — measure both.user_id) on OpenSearch Serverless can fan out into many shards. For tenant isolation prefer one index per tenant or one collection per tenant rather than a metadata filter.overrideSearchType=HYBRID blends BM25 with vector similarity. It's almost always a win for queries containing exact identifiers (order numbers, error codes, SKUs) the embedding model has never seen.statistics.documentsFailed count silently grows when source files are corrupt, encrypted, or exceed size limits. Wire a CloudWatch alarm on it.retrieve_and_generate sends every retrieved chunk into the FM prompt. Bigger chunks and more results = higher per-query token cost. Watch the $search_results$ token count.A Knowledge Base is a managed RAG pipeline: it ingests documents from S3 (or web, Confluence, Salesforce, SharePoint), chunks them, embeds with a model like Titan or Cohere, writes vectors to a configured vector store, and exposes Retrieve and RetrieveAndGenerate APIs. AWS handles the ingestion job, retries, status tracking, and incremental sync — you bring the source bucket and pick the embedding model, chunking strategy, and vector store. It eliminates writing your own LangChain ingestion code.
OpenSearch Serverless is the default — fully managed, auto-scales, supports hybrid BM25 + vector, and integrates natively. Aurora PostgreSQL with pgvector is best when you already run Aurora and want SQL joins between vectors and operational data. Pinecone or MongoDB Atlas are options when those are your existing standard. Neptune Analytics fits when retrieval is graph-shaped. For most greenfield workloads, OpenSearch Serverless wins on operational simplicity; for tenant-isolated SaaS, one collection per tenant is usually safer than metadata filtering.
Fixed-size (e.g. 300 tokens with 60-token overlap) is cheapest and the right default for clean prose. Hierarchical chunking embeds both small child chunks (for retrieval precision) and larger parent chunks (returned for context) — better recall on long documents at roughly 2x ingestion cost. Semantic chunking splits at sentence-boundary embedding-similarity drops, preserving topical coherence — most expensive at ingest but best for mixed-topic documents like meeting transcripts or RFCs.
Advanced parsing routes each page through a foundation model (Claude or Titan multimodal) that reads the page as an image and emits structured Markdown — preserving tables, multi-column layouts, equations, and figure captions that plain text extractors mangle. It costs one FM call per page so it can dwarf the embedding bill on large PDF corpora. Reserve it for layout-heavy documents (financial filings, scientific papers, scanned forms); for clean HTML or plain Markdown, the default parser is fine.
Use Retrieve when you want raw chunks plus scores and need to compose your own prompt — for example to mix KB results with tool-call output, to apply a custom system prompt, or to use a model not supported by RetrieveAndGenerate. Use RetrieveAndGenerate for the standard "answer this with citations" path; AWS builds the prompt, calls the FM, and returns the answer with source attributions. RetrieveAndGenerate is fewer lines of code and gives you citations for free; Retrieve gives you control.
Use incremental sync — the ingestion job tracks document checksums in S3 and re-embeds only changed or new files; deleted files are removed from the vector store. Trigger sync on an EventBridge schedule (hourly/daily) or on S3 event notifications via Lambda for near-real-time updates. For high-velocity sources, partition the bucket by recency so each sync scans a smaller prefix. Monitor statistics.documentsFailed and the ingestion job duration in CloudWatch; failed documents silently degrade recall if unwatched.