Amazon Bedrock is a fully managed service that makes foundation models (FMs) from leading AI companies available through a single API. It lets you build and scale generative AI applications without managing infrastructure, choosing between models from Anthropic (Claude), Meta (Llama), Mistral, Cohere, AI21 Labs, Stability AI, and Amazon's own Titan and Nova families.
This page is the overview. The deeper-dive topics live on dedicated pages:
ApplyGuardrail.InvokeModel / Converse API, so you can benchmark and swap models without rewriting application code.Bedrock does not host OpenAI models (GPT-4, GPT-4o, o1). OpenAI is available only through OpenAI's own API or Azure OpenAI Service. The example below uses Claude on Bedrock; the OpenAI comparison shows the equivalent call against OpenAI's own SDK.
The Converse API normalizes messages across providers so the same client code works for Claude, Llama, Mistral, Cohere, and Nova — swap the modelId without rewriting the call.
import boto3
bedrock = boto3.client("bedrock-runtime", region_name="us-west-2")
response = bedrock.converse(
modelId="anthropic.claude-opus-4-7",
messages=[
{"role": "user", "content": [{"text": "Summarize Q3 sales trends in 3 bullets."}]}
],
system=[{"text": "You are a concise financial analyst."}],
inferenceConfig={"maxTokens": 512, "temperature": 0.2},
)
print(response["output"]["message"]["content"][0]["text"])
For reference — if you need GPT-4o or o1, call OpenAI or Azure OpenAI directly. Note the different SDK and message shape.
from openai import OpenAI # pip install openai
client = OpenAI() # reads OPENAI_API_KEY from env
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a concise financial analyst."},
{"role": "user", "content": "Summarize Q3 sales trends in 3 bullets."},
],
max_tokens=512,
temperature=0.2,
)
print(resp.choices[0].message.content)
Many teams run a multi-provider stack: Bedrock for Claude/Llama/Mistral/Titan inside the AWS boundary (VPC, IAM, KMS, CloudTrail), and OpenAI/Azure OpenAI for GPT when a specific capability is needed. Libraries like LiteLLM or LangChain abstract the two behind a shared interface. To enforce the same safety policy across both providers, use the standalone ApplyGuardrail API from Bedrock Guardrails.
Amazon Bedrock is the primary AWS entry point for generative AI — it collapses model selection, RAG, agents, and safety into a single service so teams can focus on the application rather than the ML platform.
Bedrock is a managed multi-model gateway that exposes foundation models from Anthropic, Meta, Mistral, Cohere, AI21, Stability, and Amazon Titan behind a single SDK with IAM authentication, VPC endpoints, KMS encryption, and CloudTrail logging. Unlike calling a vendor API directly, requests never leave your AWS account boundary, billing is unified on the AWS invoice, and the same IAM policies that protect S3 or DynamoDB also gate model access. You also get bundled features — Knowledge Bases, Agents, Guardrails — without integrating multiple vendors.
Converse is a model-agnostic chat interface — the same request shape works across Claude, Llama, Mistral, and Titan, so you can swap models with a single string change. InvokeModel requires you to format the body in each provider's native schema (Anthropic Messages format, Llama prompt template, etc.). Converse also standardizes tool use, system prompts, streaming, and multi-turn history. Use Converse for new applications; reach for InvokeModel only when you need a provider-specific feature Converse hasn't yet exposed.
Choose Bedrock when the workload must stay inside an AWS account for compliance (HIPAA, FedRAMP, GovCloud), when you need Claude or Llama rather than GPT, when IAM/KMS/VPC integration matters, or when you want one bill for compute, storage, and inference. Choose OpenAI/Azure when you specifically need GPT-4o, o-series reasoning models, the Assistants API, or Azure-native enterprise features. Many production stacks run both behind a LiteLLM or LangChain abstraction and route by capability and cost.
On-demand pricing charges per input/output token with no commitment but is subject to per-account quotas (TPM and RPM) that can throttle bursty workloads. Provisioned Throughput reserves dedicated model capacity in Model Units billed hourly with a 1-month or 6-month commitment — useful for high-volume production traffic, custom-fine-tuned models (which require provisioned throughput), or when you need predictable latency. The break-even is typically when sustained traffic approaches the throughput a single MU delivers; below that, on-demand is cheaper.
Each model has per-region TPM and RPM quotas that you raise through Service Quotas. Not every model is in every region — Claude Sonnet 4 may only be in us-east-1, us-west-2, and a few EU regions, so multi-region failover requires checking the model availability matrix. Use cross-region inference profiles (CRIS) to spread load across regions transparently and absorb single-region throttling. Monitor InvocationThrottledException in CloudWatch and back off with exponential retry plus jitter; for hard SLAs, switch the affected route to provisioned throughput.
Tag every invocation with the tenant ID via session tags assumed through STS, and enable model invocation logging to S3/CloudWatch so each request is auditable. Use a thin gateway Lambda or ECS service that injects tenant context, enforces per-tenant rate limits in DynamoDB or ElastiCache, and routes to the appropriate guardrail. For cost attribution, parse the invocation logs (input/output token counts) into a usage table and apply the published per-token price; AWS Cost Explorer alone won't split Bedrock spend by tenant. Pair with a separate Knowledge Base per tenant or a metadata-filtered single KB depending on data-isolation requirements.