Knowledge Bases for Amazon Bedrock

Knowledge Bases for Amazon Bedrock is a managed retrieval-augmented generation (RAG) service. You point it at a data source (S3, a website, a SaaS connector), pick an embedding model and a vector store, and Bedrock handles ingestion, chunking, embedding, indexing, retrieval, citation tracking, and grounded generation. The result is a single API — retrieve for raw chunks, retrieve_and_generate for a fully-grounded answer — that replaces a meaningful slice of custom RAG plumbing.

1. Architecture Overview

A Knowledge Base is a thin orchestrator over four pieces:

Data Source: Where raw documents live (S3 prefix, a list of URLs, a Confluence space, etc.).
Parser: Optional. Either Bedrock's default text extractor or a foundation model used as a parser for layout-rich PDFs.
Chunker: Splits parsed text into vectorizable chunks (fixed-size, semantic, hierarchical, or custom).
Embedding Model: Titan v2, Titan v1, or Cohere Embed; turns chunks into vectors.
Vector Store: Where vectors live and search runs (OpenSearch Serverless, Aurora pgvector, Pinecone, MongoDB Atlas, Neptune Analytics).

An ingestion job walks the data source, parses, chunks, embeds, and writes to the vector store. After ingestion, queries hit the vector store and (optionally) the FM for generation.

2. Supported Data Sources

Amazon S3: The default. Point at a bucket and prefix; Bedrock recursively ingests supported file types (PDF, DOCX, TXT, MD, HTML, CSV, XLSX). Supports inclusion/exclusion patterns and metadata sidecar files (doc.pdf.metadata.json).
Web Crawler: Provide seed URLs, a depth limit, and an optional regex filter. Respects robots.txt; rate-limited to avoid hammering origin sites.
Confluence: OAuth-based connector that pulls spaces, pages, and attachments. Honors page-level permissions when you turn on metadata-based access filtering.
Salesforce: Pulls standard and custom objects, articles, and attachments via Salesforce APIs.
SharePoint Online: Microsoft 365 connector for sites, lists, and document libraries.
Custom (direct ingestion): Skip the connector layer entirely and call IngestKnowledgeBaseDocuments with documents constructed in your own code — useful when the source is a database query or an in-house API.

2.1 S3 Metadata Sidecars

Attach metadata to a chunk to enable filtered retrieval (e.g. only "year=2026" docs). Drop a JSON file next to each source file:


{
  "metadataAttributes": {
    "year":       { "value": { "type": "NUMBER", "numberValue": 2026 } },
    "department": { "value": { "type": "STRING", "stringValue": "HR" } },
    "tags":       { "value": { "type": "STRING_LIST", "stringListValue": ["policy", "leave"] } }
  }
}

Filename convention: if the source is policies/2026-leave.pdf, the sidecar is policies/2026-leave.pdf.metadata.json.

3. Supported Vector Stores

Amazon OpenSearch Serverless: Default. Auto-scales, no instances to size; supports hybrid lexical + vector search; HNSW index. Best when you want zero-ops and don't already have a vector store.
Aurora PostgreSQL with pgvector: Lives in your VPC; familiar SQL surface for hybrid queries that join vectors with relational tables. Best when documents have rich structured metadata you want to filter on with SQL.
Pinecone: Managed third-party vector DB; sub-100 ms p99 at scale. Best when you already standardize on Pinecone or need its specific filtering/namespace features.
MongoDB Atlas Vector Search: Stores vectors alongside JSON documents. Best when the canonical document store is already MongoDB.
Amazon Neptune Analytics: Graph-native; lets the KB combine vector similarity with graph traversal for "GraphRAG" patterns.
Redis Enterprise Cloud: Available for low-latency, in-memory vector retrieval scenarios.

For each store you must pre-create the collection/database and pass field-mapping hints (vector field, text field, metadata field) when creating the KB.

4. Chunking Strategies

Fixed-size: Default. Tokens per chunk plus an overlap percentage. Simple and predictable; works well when documents are roughly uniform paragraphs of prose.
Semantic: Embeds successive sentences and breaks at semantic boundaries (when adjacent sentence embeddings diverge past a threshold). Better recall on heterogeneous corpora; costs extra embedding calls during ingestion.
Hierarchical: Produces both a small "child" chunk used at retrieval time and a larger "parent" chunk that's actually returned to the model. Improves precision (small chunks match better) without losing context (parent chunks are wide).
None: Treat each source file as a single chunk. Use only for small, atomic docs (FAQ entries, JSON records).
Custom (Lambda): You implement chunking yourself in a Lambda. Bedrock invokes it with the parsed document; you return chunk text plus per-chunk metadata. Use when you need domain-aware splitting (split a legal contract by clause, code by function, etc.).

5. Embedding Model Choice

Amazon Titan Embeddings v2 (amazon.titan-embed-text-v2:0): 1024 default dims (also 512 and 256 via dimensions parameter), normalized output, multilingual. Default and best general-purpose pick on Bedrock.
Amazon Titan Embeddings v1 (amazon.titan-embed-text-v1): 1536 dims. Older but still serviceable.
Cohere Embed Multilingual v3 (cohere.embed-multilingual-v3): 1024 dims, optimized for >100 languages. Best when the corpus is non-English or mixed-language.
Cohere Embed English v3 (cohere.embed-english-v3): 1024 dims, English-only.

Pick the embedding model up front and treat it as immutable — switching models means reindexing the entire corpus. Smaller dimensions (Titan v2 at 256d) cut storage and latency by ~4x with a small recall penalty; worth measuring on your data.

6. Create a Knowledge Base with boto3


import boto3

agent = boto3.client("bedrock-agent", region_name="us-west-2")

kb = agent.create_knowledge_base(
    name="hr-policies",
    description="Internal HR policy documents (US, EMEA, APAC).",
    roleArn="arn:aws:iam::111111111111:role/BedrockKBRole",
    knowledgeBaseConfiguration={
        "type": "VECTOR",
        "vectorKnowledgeBaseConfiguration": {
            "embeddingModelArn": "arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v2:0",
            "embeddingModelConfiguration": {
                "bedrockEmbeddingModelConfiguration": {
                    "dimensions": 1024,
                    "embeddingDataType": "FLOAT32",
                }
            },
        },
    },
    storageConfiguration={
        "type": "OPENSEARCH_SERVERLESS",
        "opensearchServerlessConfiguration": {
            "collectionArn": "arn:aws:aoss:us-west-2:111111111111:collection/abc123",
            "vectorIndexName": "hr-policies-idx",
            "fieldMapping": {
                "vectorField":   "embedding",
                "textField":     "text",
                "metadataField": "metadata",
            },
        },
    },
)

kb_id = kb["knowledgeBase"]["knowledgeBaseId"]
print("KB:", kb_id)

ds = agent.create_data_source(
    knowledgeBaseId=kb_id,
    name="hr-policies-s3",
    dataSourceConfiguration={
        "type": "S3",
        "s3Configuration": {
            "bucketArn":               "arn:aws:s3:::company-hr-docs",
            "inclusionPrefixes":       ["policies/"],
            "bucketOwnerAccountId":    "111111111111",
        },
    },
    vectorIngestionConfiguration={
        "chunkingConfiguration": {
            "chunkingStrategy": "HIERARCHICAL",
            "hierarchicalChunkingConfiguration": {
                "levelConfigurations": [
                    {"maxTokens": 1500},  # parent
                    {"maxTokens": 300},   # child
                ],
                "overlapTokens": 60,
            },
        },
    },
)
print("DS:", ds["dataSource"]["dataSourceId"])

7. Run an Ingestion Job

Ingestion jobs are async. Trigger one whenever the data source changes; Bedrock detects added/modified/deleted files and updates only the affected chunks (incremental sync).


import time

job = agent.start_ingestion_job(knowledgeBaseId=kb_id, dataSourceId=ds_id)
job_id = job["ingestionJob"]["ingestionJobId"]

while True:
    status = agent.get_ingestion_job(
        knowledgeBaseId=kb_id, dataSourceId=ds_id, ingestionJobId=job_id,
    )["ingestionJob"]
    state = status["status"]
    print(state, status.get("statistics", {}))
    if state in ("COMPLETE", "FAILED", "STOPPED"):
        break
    time.sleep(10)

The statistics block reports documents scanned, indexed, modified, deleted, and failed — log it to CloudWatch as your ingestion SLO.

Trigger ingestion automatically by wiring an S3 EventBridge rule on Object Created events to a Lambda that calls start_ingestion_job.

8. Retrieve and Retrieve-and-Generate

8.1 `retrieve` — raw chunks only

Use this when you want to do your own prompting, rerank with a different model, or display raw search results.


runtime = boto3.client("bedrock-agent-runtime", region_name="us-west-2")

resp = runtime.retrieve(
    knowledgeBaseId=kb_id,
    retrievalQuery={"text": "How many weeks of parental leave do EMEA employees get?"},
    retrievalConfiguration={
        "vectorSearchConfiguration": {
            "numberOfResults": 5,
            "overrideSearchType": "HYBRID",  # SEMANTIC | HYBRID
            "filter": {
                "andAll": [
                    {"equals":      {"key": "department", "value": "HR"}},
                    {"greaterThan": {"key": "year",       "value": 2024}},
                ]
            },
        }
    },
)

for r in resp["retrievalResults"]:
    print(round(r["score"], 3), r["location"], r["content"]["text"][:120])

8.2 `retrieve_and_generate` — grounded answer in one call


resp = runtime.retrieve_and_generate(
    input={"text": "How many weeks of parental leave do EMEA employees get?"},
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": kb_id,
            "modelArn": "arn:aws:bedrock:us-west-2::foundation-model/anthropic.claude-opus-4-7",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {"numberOfResults": 8, "overrideSearchType": "HYBRID"}
            },
            "generationConfiguration": {
                "inferenceConfig": {"textInferenceConfig": {
                    "temperature": 0.0, "maxTokens": 600,
                }},
                "promptTemplate": {"textPromptTemplate": (
                    "You are an HR assistant. Answer using ONLY the search results below. "
                    "If the answer is not present, say 'I don't have that policy on file.'\n\n"
                    "$search_results$\n\nQuestion: $query$"
                )},
            },
        },
    },
)
print(resp["output"]["text"])

8.3 Multi-turn Sessions

Pass sessionId from one call into the next so the KB carries chat context (it rewrites follow-up questions like "what about APAC?" into standalone queries before retrieving).


session_id = resp["sessionId"]
followup = runtime.retrieve_and_generate(
    input={"text": "What about APAC?"},
    sessionId=session_id,
    retrieveAndGenerateConfiguration=resp_config,  # same as above
)

9. Citations and Grounding

Every retrieve_and_generate response includes a citations array that maps spans of the generated text to specific retrieved chunks. Surface these in the UI to let users verify the answer.


text = resp["output"]["text"]

for cite in resp.get("citations", []):
    span = cite["generatedResponsePart"]["textResponsePart"]["span"]
    quoted = text[span["start"]:span["end"] + 1]
    print(f"---\nCLAIM: {quoted}")
    for ref in cite["retrievedReferences"]:
        loc = ref["location"]
        kind = loc["type"]
        if kind == "S3":
            print(f"  source: {loc['s3Location']['uri']}")
        elif kind == "WEB":
            print(f"  source: {loc['webLocation']['url']}")
        print(f"  chunk:  {ref['content']['text'][:160]}...")

Citations are also the raw material for hallucination guardrails — wire them into a contextual-grounding guardrail (see Bedrock Guardrails) to block answers that drift from the cited context.

10. Advanced Parsing with FM-as-Parser

Default parsing extracts plain text — fine for prose but loses structure in slide decks, tables, and financial PDFs. Enable advanced parsing to use a foundation model to interpret each page as Markdown, preserving tables, headings, and figure captions.


agent.create_data_source(
    knowledgeBaseId=kb_id,
    name="financial-reports-s3",
    dataSourceConfiguration={"type": "S3", "s3Configuration": {
        "bucketArn": "arn:aws:s3:::company-finance-docs",
    }},
    vectorIngestionConfiguration={
        "parsingConfiguration": {
            "parsingStrategy": "BEDROCK_FOUNDATION_MODEL",
            "bedrockFoundationModelConfiguration": {
                "modelArn": "arn:aws:bedrock:us-west-2::foundation-model/anthropic.claude-opus-4-7",
                "parsingPrompt": {"parsingPromptText": (
                    "Convert each page to Markdown. Preserve tables as GitHub-flavored "
                    "Markdown tables. Render figures as '![figure: ]'."
                )},
            },
        },
        "chunkingConfiguration": {
            "chunkingStrategy": "FIXED_SIZE",
            "fixedSizeChunkingConfiguration": {"maxTokens": 500, "overlapPercentage": 15},
        },
    },
)

FM parsing costs more (one model call per page) and slows ingestion materially. Reserve it for documents where layout actually carries meaning — annual reports, scientific papers, regulatory filings.

11. When to Use a KB vs Roll Your Own

Use a Bedrock KB when the corpus is in S3 / a supported SaaS source, you want managed ingestion and citations, and you're already on Bedrock for inference. The "do RAG" feature lights up in days.
Roll your own RAG when you need a non-supported retriever (BM25-only, ColBERT, learned-sparse), aggressive reranking with custom models, hybrid retrieval that joins multiple stores, or full control over prompt templates and citation rendering.
Hybrid: Use the KB for retrieval (retrieve), then run your own prompt construction, reranker, and generation. You get managed ingestion + indexing without giving up prompt control.

Knowledge Bases for Bedrock collapse most of the RAG plumbing into a managed service. The trade-off — as always with managed services — is lower flexibility on the retrieval pipeline. Start with the KB; reach for custom RAG only when an evaluation actually fails because of it.

12. Operational Tips

Reindex on model change. Switching the embedding model invalidates every existing vector. Plan for a full rebuild — provision the new index, ingest into it, then cut the alias atomically. Never partially mix vectors from different embedding models in the same index.
Tune numberOfResults empirically. The default of 5 is fine for narrow corpora; for broad corpora bump to 8–12 and let the FM filter. Each extra result is more tokens in the prompt and a small latency cost — measure both.
Watch metadata cardinality. Filtering on a high-cardinality field (e.g. user_id) on OpenSearch Serverless can fan out into many shards. For tenant isolation prefer one index per tenant or one collection per tenant rather than a metadata filter.
Enable hybrid search. overrideSearchType=HYBRID blends BM25 with vector similarity. It's almost always a win for queries containing exact identifiers (order numbers, error codes, SKUs) the embedding model has never seen.
Score-threshold guarding. Set a minimum similarity score in your application layer — if no chunk clears it, refuse to answer rather than letting the model improvise. This is a cheap hallucination guard that complements the contextual-grounding guardrail.
Audit ingestion failures. The statistics.documentsFailed count silently grows when source files are corrupt, encrypted, or exceed size limits. Wire a CloudWatch alarm on it.

13. Cost Components

Embedding tokens at ingest: One embedding call per chunk; pay per 1k input tokens at the embedding-model rate. Hierarchical chunking roughly doubles the call count vs fixed-size for the same corpus.
Vector store fees: OpenSearch Serverless bills for OCUs (indexing OCUs + search OCUs, 2 minimum each); Aurora bills for the cluster; Pinecone bills per pod or serverless unit. The vector store is usually the biggest fixed cost.
Generation tokens at query: retrieve_and_generate sends every retrieved chunk into the FM prompt. Bigger chunks and more results = higher per-query token cost. Watch the $search_results$ token count.
Advanced parsing: One FM call per page on ingest; can dwarf the embedding cost for large PDF corpora. Reserve for layout-heavy documents.
Query-rewriting on multi-turn sessions: Each follow-up triggers an extra FM call to rewrite the question into a standalone form before retrieving. Disable in single-turn batch workflows.

Common Interview Questions:

What is a Bedrock Knowledge Base and what does it manage for you?

A Knowledge Base is a managed RAG pipeline: it ingests documents from S3 (or web, Confluence, Salesforce, SharePoint), chunks them, embeds with a model like Titan or Cohere, writes vectors to a configured vector store, and exposes Retrieve and RetrieveAndGenerate APIs. AWS handles the ingestion job, retries, status tracking, and incremental sync — you bring the source bucket and pick the embedding model, chunking strategy, and vector store. It eliminates writing your own LangChain ingestion code.

How do you choose a vector store for a Knowledge Base?

OpenSearch Serverless is the default — fully managed, auto-scales, supports hybrid BM25 + vector, and integrates natively. Aurora PostgreSQL with pgvector is best when you already run Aurora and want SQL joins between vectors and operational data. Pinecone or MongoDB Atlas are options when those are your existing standard. Neptune Analytics fits when retrieval is graph-shaped. For most greenfield workloads, OpenSearch Serverless wins on operational simplicity; for tenant-isolated SaaS, one collection per tenant is usually safer than metadata filtering.

Compare fixed-size, hierarchical, and semantic chunking.

Fixed-size (e.g. 300 tokens with 60-token overlap) is cheapest and the right default for clean prose. Hierarchical chunking embeds both small child chunks (for retrieval precision) and larger parent chunks (returned for context) — better recall on long documents at roughly 2x ingestion cost. Semantic chunking splits at sentence-boundary embedding-similarity drops, preserving topical coherence — most expensive at ingest but best for mixed-topic documents like meeting transcripts or RFCs.

What is advanced parsing and when is it worth the cost?

Advanced parsing routes each page through a foundation model (Claude or Titan multimodal) that reads the page as an image and emits structured Markdown — preserving tables, multi-column layouts, equations, and figure captions that plain text extractors mangle. It costs one FM call per page so it can dwarf the embedding bill on large PDF corpora. Reserve it for layout-heavy documents (financial filings, scientific papers, scanned forms); for clean HTML or plain Markdown, the default parser is fine.

When do you call Retrieve vs. RetrieveAndGenerate?

Use Retrieve when you want raw chunks plus scores and need to compose your own prompt — for example to mix KB results with tool-call output, to apply a custom system prompt, or to use a model not supported by RetrieveAndGenerate. Use RetrieveAndGenerate for the standard "answer this with citations" path; AWS builds the prompt, calls the FM, and returns the answer with source attributions. RetrieveAndGenerate is fewer lines of code and gives you citations for free; Retrieve gives you control.

How do you keep a Knowledge Base fresh without re-embedding everything?

Use incremental sync — the ingestion job tracks document checksums in S3 and re-embeds only changed or new files; deleted files are removed from the vector store. Trigger sync on an EventBridge schedule (hourly/daily) or on S3 event notifications via Lambda for near-real-time updates. For high-velocity sources, partition the bucket by recency so each sync scans a smaller prefix. Monitor statistics.documentsFailed and the ingestion job duration in CloudWatch; failed documents silently degrade recall if unwatched.