Amazon Bedrock

Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI companies available through a single API. It lets you build and scale generative AI applications without managing infrastructure, choosing between models from Anthropic (Claude), Meta (Llama), Mistral, Cohere, AI21 Labs, Stability AI, and Amazon's own Titan and Nova families.

This page is the overview. The deeper-dive topics live on dedicated pages:

Agents for Bedrock — planning loop, action groups, return-of-control, traces.
Knowledge Bases for Bedrock — managed RAG over S3, Confluence, SharePoint, Salesforce.
Guardrails for Bedrock — denied topics, content filters, PII, contextual grounding, ApplyGuardrail.
SageMaker JumpStart — when you need an open-weights model Bedrock doesn't host.

Key Features:

Model Choice via Single API: Invoke any supported foundation model with the same InvokeModel / Converse API, so you can benchmark and swap models without rewriting application code.
Knowledge Bases (RAG): Managed retrieval-augmented generation over your own data — Bedrock handles chunking, embedding, vector storage (OpenSearch Serverless, Aurora, Pinecone, MongoDB Atlas) and retrieval. See Bedrock Knowledge Bases.
Agents for Bedrock: Orchestrates multi-step tasks by planning, invoking tools/APIs, and calling knowledge bases — handles memory, session state, and return-of-control to your own code. See Bedrock Agents.
Guardrails: Policy layer that filters prompts and completions for denied topics, PII, profanity, prompt injections, and contextual grounding — applied consistently across any model. See Bedrock Guardrails.
Custom Model Fine-Tuning & Continued Pre-Training: Adapt foundation models to your domain with labeled or unlabeled data; resulting custom models remain private to your account.
Provisioned Throughput: Reserve dedicated model capacity for predictable latency and throughput on production workloads; on-demand is the default for variable traffic.
Private & Secure by Default: All traffic stays within the AWS network, your prompts and completions are not used to train base models, and VPC endpoints (PrivateLink), IAM, and KMS integrate natively.
Model Evaluation: Built-in automated and human-in-the-loop evaluation jobs for accuracy, robustness, and toxicity across your candidate models.

Common Use Cases:

Enterprise Chat Assistants: Ground conversational assistants in private documentation using Knowledge Bases + Guardrails for consistent, policy-safe responses.
Document Intelligence: Summarization, classification, and extraction over contracts, invoices, and medical records — often paired with Amazon Textract.
Code & Content Generation: Generate boilerplate, drafts, marketing copy, and translations with model-specific strengths (e.g., Claude for long-context reasoning, Titan for embeddings).
Agentic Workflows: Tool-using agents that query internal APIs (inventory, CRM, scheduling) and take actions on behalf of users.
Semantic Search & Recommendations: Use Titan or Cohere embeddings to power vector search over product catalogs, support tickets, or research corpora.

A Note on Model Availability:

Bedrock does not host OpenAI models (GPT-4, GPT-4o, o1). OpenAI is available only through OpenAI's own API or Azure OpenAI Service. The example below uses Claude on Bedrock; the OpenAI comparison shows the equivalent call against OpenAI's own SDK.

Claude via the Converse API (model-agnostic, recommended)

The Converse API normalizes messages across providers so the same client code works for Claude, Llama, Mistral, Cohere, and Nova — swap the modelId without rewriting the call.


import boto3

bedrock = boto3.client("bedrock-runtime", region_name="us-west-2")

response = bedrock.converse(
    modelId="anthropic.claude-opus-4-7",
    messages=[
        {"role": "user", "content": [{"text": "Summarize Q3 sales trends in 3 bullets."}]}
    ],
    system=[{"text": "You are a concise financial analyst."}],
    inferenceConfig={"maxTokens": 512, "temperature": 0.2},
)

print(response["output"]["message"]["content"][0]["text"])

Comparison: The Same Task Against OpenAI (Not on Bedrock)

For reference — if you need GPT-4o or o1, call OpenAI or Azure OpenAI directly. Note the different SDK and message shape.


from openai import OpenAI  # pip install openai

client = OpenAI()  # reads OPENAI_API_KEY from env

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a concise financial analyst."},
        {"role": "user",   "content": "Summarize Q3 sales trends in 3 bullets."},
    ],
    max_tokens=512,
    temperature=0.2,
)

print(resp.choices[0].message.content)

Many teams run a multi-provider stack: Bedrock for Claude/Llama/Mistral/Titan inside the AWS boundary (VPC, IAM, KMS, CloudTrail), and OpenAI/Azure OpenAI for GPT when a specific capability is needed. Libraries like LiteLLM or LangChain abstract the two behind a shared interface. To enforce the same safety policy across both providers, use the standalone ApplyGuardrail API from Bedrock Guardrails.

When to Choose Bedrock vs. SageMaker:

Choose Bedrock when you want a managed API to consume foundation models, need guardrails / knowledge bases / agents out of the box, and do not want to manage GPUs or model-hosting infrastructure.
Choose SageMaker JumpStart when you need an open-weights model Bedrock doesn't host (Falcon, BGE, Stable Diffusion XL), need fine-tuning beyond Bedrock's recipes, want a custom serving container, or require strict VPC-only deployment of the model itself.
Choose SageMaker (full) when you need to train models from scratch, host arbitrary custom models with full control over instance type and serving stack, or run large-scale distributed training.

Amazon Bedrock is the primary AWS entry point for generative AI — it collapses model selection, RAG, agents, and safety into a single service so teams can focus on the application rather than the ML platform.

Common Interview Questions:

What is Amazon Bedrock and how does it differ from calling a model API directly?

Bedrock is a managed multi-model gateway that exposes foundation models from Anthropic, Meta, Mistral, Cohere, AI21, Stability, and Amazon Titan behind a single SDK with IAM authentication, VPC endpoints, KMS encryption, and CloudTrail logging. Unlike calling a vendor API directly, requests never leave your AWS account boundary, billing is unified on the AWS invoice, and the same IAM policies that protect S3 or DynamoDB also gate model access. You also get bundled features — Knowledge Bases, Agents, Guardrails — without integrating multiple vendors.

What is the Converse API and why use it over InvokeModel?

Converse is a model-agnostic chat interface — the same request shape works across Claude, Llama, Mistral, and Titan, so you can swap models with a single string change. InvokeModel requires you to format the body in each provider's native schema (Anthropic Messages format, Llama prompt template, etc.). Converse also standardizes tool use, system prompts, streaming, and multi-turn history. Use Converse for new applications; reach for InvokeModel only when you need a provider-specific feature Converse hasn't yet exposed.

When would you choose Bedrock over OpenAI or Azure OpenAI?

Choose Bedrock when the workload must stay inside an AWS account for compliance (HIPAA, FedRAMP, GovCloud), when you need Claude or Llama rather than GPT, when IAM/KMS/VPC integration matters, or when you want one bill for compute, storage, and inference. Choose OpenAI/Azure when you specifically need GPT-4o, o-series reasoning models, the Assistants API, or Azure-native enterprise features. Many production stacks run both behind a LiteLLM or LangChain abstraction and route by capability and cost.

Explain on-demand vs. provisioned throughput on Bedrock.

On-demand pricing charges per input/output token with no commitment but is subject to per-account quotas (TPM and RPM) that can throttle bursty workloads. Provisioned Throughput reserves dedicated model capacity in Model Units billed hourly with a 1-month or 6-month commitment — useful for high-volume production traffic, custom-fine-tuned models (which require provisioned throughput), or when you need predictable latency. The break-even is typically when sustained traffic approaches the throughput a single MU delivers; below that, on-demand is cheaper.

How do you handle Bedrock quota limits and regional availability?

Each model has per-region TPM and RPM quotas that you raise through Service Quotas. Not every model is in every region — Claude Sonnet 4 may only be in us-east-1, us-west-2, and a few EU regions, so multi-region failover requires checking the model availability matrix. Use cross-region inference profiles (CRIS) to spread load across regions transparently and absorb single-region throttling. Monitor InvocationThrottledException in CloudWatch and back off with exponential retry plus jitter; for hard SLAs, switch the affected route to provisioned throughput.

How would you architect a multi-tenant SaaS on Bedrock with per-tenant cost tracking and isolation?

Tag every invocation with the tenant ID via session tags assumed through STS, and enable model invocation logging to S3/CloudWatch so each request is auditable. Use a thin gateway Lambda or ECS service that injects tenant context, enforces per-tenant rate limits in DynamoDB or ElastiCache, and routes to the appropriate guardrail. For cost attribution, parse the invocation logs (input/output token counts) into a usage table and apply the published per-token price; AWS Cost Explorer alone won't split Bedrock spend by tenant. Pair with a separate Knowledge Base per tenant or a metadata-filtered single KB depending on data-isolation requirements.