Guardrails for Amazon Bedrock

Guardrails for Amazon Bedrock is a policy layer that screens both the user prompt and the model completion against rules you define — denied topics, content categories, banned words, sensitive information, and contextual grounding for RAG. A guardrail can be attached to any Bedrock model invocation, or invoked standalone via ApplyGuardrail against any text — including completions from non-Bedrock models like OpenAI. The point is to centralize "what the assistant must never do" once, not re-implement it in every prompt.

1. Filter Types

A guardrail bundles five filter families. Each is independently configurable; you can mix and match. All filters apply to both prompts (input) and responses (output) by default — disable per direction when it makes sense.

1.1 Denied Topics

Block entire conversational topics defined in natural language plus example phrases. The classifier is a small dedicated model — far more flexible than a regex.


"topicPolicyConfig": {
    "topicsConfig": [
        {
            "name":       "Investment Advice",
            "definition": "Personalized recommendations on what securities, funds, or "
                          "crypto assets a user should buy, sell, or hold.",
            "examples":   [
                "Should I sell my AAPL shares?",
                "What's a good ETF to buy right now?",
                "Is bitcoin a good long-term hold?",
            ],
            "type": "DENY",
        }
    ]
}

1.2 Content Filters

Six harm categories, each with a strength dial: NONE, LOW, MEDIUM, HIGH. Stronger settings catch more borderline cases at the cost of more false positives.

HATE — content that targets identity attributes.
INSULTS — demeaning or harassing language.
SEXUAL — sexual content.
VIOLENCE — graphic or threatening violence.
MISCONDUCT — instructions enabling crimes or fraud.
PROMPT_ATTACK — jailbreaks, role-play hijacks, prompt injection (input only).


"contentPolicyConfig": {
    "filtersConfig": [
        {"type": "HATE",          "inputStrength": "HIGH",   "outputStrength": "HIGH"},
        {"type": "INSULTS",       "inputStrength": "MEDIUM", "outputStrength": "MEDIUM"},
        {"type": "SEXUAL",        "inputStrength": "HIGH",   "outputStrength": "HIGH"},
        {"type": "VIOLENCE",      "inputStrength": "HIGH",   "outputStrength": "HIGH"},
        {"type": "MISCONDUCT",    "inputStrength": "HIGH",   "outputStrength": "HIGH"},
        {"type": "PROMPT_ATTACK", "inputStrength": "HIGH",   "outputStrength": "NONE"},
    ]
}

1.3 Word Filters

Two layers: a managed Profanity list and a custom word list. The custom list is the right place for brand-safety terms (competitor names, internal codenames, product names you don't want hallucinated).


"wordPolicyConfig": {
    "wordsConfig":           [{"text": "ProjectAtlas"}, {"text": "CompetitorCorp"}],
    "managedWordListsConfig": [{"type": "PROFANITY"}],
}

1.4 Sensitive-Info / PII Filters

Detects 30+ entity types (SSN, credit-card, phone, email, names, addresses, IP, plus AWS-specific ones like access keys). Each entity has an action: BLOCK (refuse the request), ANONYMIZE (mask the value), or implicit redaction in the model context.


"sensitiveInformationPolicyConfig": {
    "piiEntitiesConfig": [
        {"type": "US_SOCIAL_SECURITY_NUMBER", "action": "BLOCK"},
        {"type": "CREDIT_DEBIT_CARD_NUMBER",  "action": "BLOCK"},
        {"type": "EMAIL",                     "action": "ANONYMIZE"},
        {"type": "PHONE",                     "action": "ANONYMIZE"},
        {"type": "NAME",                      "action": "ANONYMIZE"},
        {"type": "AWS_ACCESS_KEY",            "action": "BLOCK"},
    ],
    "regexesConfig": [{
        "name":        "InternalTicketId",
        "description": "Internal ticket identifiers like TKT-12345.",
        "pattern":     "TKT-\\d{4,8}",
        "action":      "ANONYMIZE",
    }],
}

Use ANONYMIZE when the model still needs the structure of the message but not the literal value (e.g. "send an email to {EMAIL}" still parses). Use BLOCK for things that should never reach the model at all.

1.5 Contextual Grounding (RAG Hallucination Filter)

For RAG flows: after generation, the guardrail scores how well the answer is grounded in the retrieved context and how relevant it is to the user query. Below either threshold, the response is blocked.


"contextualGroundingPolicyConfig": {
    "filtersConfig": [
        {"type": "GROUNDING", "threshold": 0.75},  # answer must be supported by context
        {"type": "RELEVANCE", "threshold": 0.70},  # answer must address the query
    ]
}

To use this, pass the retrieved context to the Converse call as guardContent blocks; the guardrail compares the response against those blocks. Pair this with Knowledge Bases for end-to-end hallucination control.

2. Creating a Guardrail


import boto3

bedrock = boto3.client("bedrock", region_name="us-west-2")

resp = bedrock.create_guardrail(
    name="support-bot-guardrail",
    description="Default guardrail for the customer-support assistant.",
    blockedInputMessaging  ="I can't help with that request.",
    blockedOutputsMessaging="I can't share that information.",
    topicPolicyConfig={
        "topicsConfig": [{
            "name": "Investment Advice",
            "definition": "Personalized recommendations on securities, funds, or crypto.",
            "examples": ["Should I buy AAPL?", "Is BTC a good long hold?"],
            "type": "DENY",
        }]
    },
    contentPolicyConfig={"filtersConfig": [
        {"type": "HATE",          "inputStrength": "HIGH",   "outputStrength": "HIGH"},
        {"type": "VIOLENCE",      "inputStrength": "HIGH",   "outputStrength": "HIGH"},
        {"type": "PROMPT_ATTACK", "inputStrength": "HIGH",   "outputStrength": "NONE"},
    ]},
    sensitiveInformationPolicyConfig={"piiEntitiesConfig": [
        {"type": "US_SOCIAL_SECURITY_NUMBER", "action": "BLOCK"},
        {"type": "CREDIT_DEBIT_CARD_NUMBER",  "action": "BLOCK"},
        {"type": "EMAIL",                     "action": "ANONYMIZE"},
    ]},
    contextualGroundingPolicyConfig={"filtersConfig": [
        {"type": "GROUNDING", "threshold": 0.75},
        {"type": "RELEVANCE", "threshold": 0.70},
    ]},
)

guardrail_id = resp["guardrailId"]
print("Created", guardrail_id, "version", resp["version"])

3. Attaching to Converse / InvokeModel

Pass guardrailConfig to any Bedrock runtime call. The guardrail runs on the input first; if blocked, the model is never called. It runs again on the output before returning.


runtime = boto3.client("bedrock-runtime", region_name="us-west-2")

resp = runtime.converse(
    modelId="anthropic.claude-opus-4-7",
    messages=[{"role": "user", "content": [{"text": "My SSN is 123-45-6789, can you help with my account?"}]}],
    guardrailConfig={
        "guardrailIdentifier": "gr-pii-strict",
        "guardrailVersion":    "3",
        "trace":               "enabled",
    },
)

print("Stop:", resp["stopReason"])  # 'guardrail_intervened' on a block
print(resp["output"]["message"]["content"][0]["text"])

3.1 Tagging Specific Content for Grounding Checks

Wrap the retrieved context in guardContent blocks so the contextual-grounding filter knows what the answer is supposed to be grounded in:


runtime.converse(
    modelId="anthropic.claude-opus-4-7",
    messages=[{"role": "user", "content": [
        {"guardContent": {"text": {
            "text": "Source: 2026 leave policy. EMEA staff receive 18 weeks of paid parental leave.",
            "qualifiers": ["grounding_source"],
        }}},
        {"text": "How many weeks of parental leave do EMEA employees get?"},
    ]}],
    guardrailConfig={
        "guardrailIdentifier": guardrail_id,
        "guardrailVersion":    "DRAFT",
        "trace":               "enabled",
    },
)

4. Standalone Invocation with ApplyGuardrail

ApplyGuardrail runs a guardrail against arbitrary text without invoking a model. Use it to screen completions from non-Bedrock models (OpenAI, Azure, a self-hosted Llama), to validate user-generated content before storage, or to gate any text crossing a trust boundary.


from openai import OpenAI

openai = OpenAI()
runtime = boto3.client("bedrock-runtime", region_name="us-west-2")

# 1. Screen the user prompt against the guardrail
prompt = "Tell me how to bypass my company's expense-policy approvals."

input_check = runtime.apply_guardrail(
    guardrailIdentifier=guardrail_id,
    guardrailVersion="3",
    source="INPUT",
    content=[{"text": {"text": prompt}}],
)

if input_check["action"] == "GUARDRAIL_INTERVENED":
    print("Blocked at input:", input_check["assessments"])
    raise SystemExit

# 2. Call OpenAI (or any non-Bedrock model)
gpt = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
).choices[0].message.content

# 3. Screen the model output before returning to the user
output_check = runtime.apply_guardrail(
    guardrailIdentifier=guardrail_id,
    guardrailVersion="3",
    source="OUTPUT",
    content=[{"text": {"text": gpt}}],
)

if output_check["action"] == "GUARDRAIL_INTERVENED":
    final = output_check["outputs"][0]["text"]  # the blockedOutputsMessaging string
else:
    final = gpt

print(final)

This is the building block for "BYO model + AWS-native safety": the model runs anywhere; the guardrail runs in your AWS account with a complete CloudTrail audit log of every intervention.

5. Reading the Intervention Trace

With trace: "enabled", the response includes assessments showing exactly which filter triggered and at what confidence. Log these for tuning and incident review.


trace = resp.get("trace", {}).get("guardrail", {})

for kind in ("inputAssessment", "outputAssessments"):
    for asm in (trace.get(kind, {}).values() if kind == "inputAssessment" else trace.get(kind, [])):
        # Topic policy
        for t in asm.get("topicPolicy", {}).get("topics", []):
            print(f"TOPIC {t['name']}: {t['action']}")
        # Content policy
        for f in asm.get("contentPolicy", {}).get("filters", []):
            print(f"CONTENT {f['type']}: {f['action']} confidence={f['confidence']}")
        # PII
        for p in asm.get("sensitiveInformationPolicy", {}).get("piiEntities", []):
            print(f"PII {p['type']}: {p['action']} match='{p['match']}'")
        # Grounding
        for g in asm.get("contextualGroundingPolicy", {}).get("filters", []):
            print(f"GROUNDING {g['type']}: {g['action']} score={g['score']} threshold={g['threshold']}")

6. Versions, Aliases & Drafts

Guardrails are versioned the same way as Lambda: a DRAFT you can mutate, plus immutable numbered versions you can pin in production. Always pin a numeric version in production callers — never reference DRAFT from a live application, or a guardrail-author's edit can ship to prod accidentally.


# Edit the draft
bedrock.update_guardrail(guardrailIdentifier=guardrail_id, name="support-bot-guardrail", ...)

# Cut a new immutable version
v = bedrock.create_guardrail_version(
    guardrailIdentifier=guardrail_id,
    description="Tightened HATE input strength to HIGH after 2026-Q2 incident review.",
)
print("Pin this in callers:", v["version"])

7. Comparison: Azure Content Safety, OpenAI Moderation

Bedrock Guardrails: Topics + content + words + PII + grounding in one config; works as a Bedrock filter or standalone via ApplyGuardrail; CloudTrail and CloudWatch native; pinned via versioned IDs. Ties cleanly into Bedrock IAM policies.
Azure AI Content Safety: Strong, well-tuned content categories (Hate, Sexual, Violence, Self-harm) with severity levels; jailbreak / prompt-injection detection; "Groundedness" detection for RAG. Covers the same surface area as Bedrock content + grounding filters; less flexible "denied topics" — you express those as custom blocklists or as detection on top of categories.
OpenAI Moderation API: Free, single-call classifier across hate, harassment, self-harm, sexual, violence (with /minors sub-categories) and prompt-attack signal. No PII redaction, no grounding check, no denied-topic concept. Best as a cheap first line of defense; not a full guardrail layer.

If you're multi-cloud, the practical pattern is: pick one guardrail surface as the canonical one (whichever cloud hosts most of your inference), then apply it to all model output via the standalone endpoints (ApplyGuardrail, Azure Content Safety REST). That way one team owns "the policy" and every channel enforces the same rules.

8. Operational Tips

Always pin a numeric version in production. Reference guardrailVersion: "3", not "DRAFT". Cut a new version (with a description that explains why) every time you change behavior, the same way you'd cut a release.
Log every intervention. Send the trace's assessments block to CloudWatch Logs Insights or a separate "guardrail events" S3 prefix. False positives are how you tune; false negatives are how you patch policy gaps.
Tune topic examples first, strength dials second. Denied-topic accuracy improves dramatically with 5–8 representative examples. Bumping content-filter strength is a blunter instrument and tends to add false positives across the board.
Test guardrails in CI. Maintain a small fixture set of "must block" and "must pass" prompts and run them against the guardrail before promoting a new version to PROD. Treat policy regressions like code regressions.
Customize the blocked messages. The default "Sorry, I can't help with that" is jarring. Set blockedInputMessaging and blockedOutputsMessaging to something product-appropriate, possibly with a deflection ("...try the support form at /help").
Mind the latency budget. The guardrail adds one extra inference per direction (input then output). For low-latency chat UIs, stream the model output and run the output guardrail on the assembled response — not on every token.

9. Quotas & Limits to Watch

Topics per guardrail: Soft cap (~30); ask AWS for an increase if you need more, or fold related topics into a single broader definition.
Custom words: Up to 10,000 entries per guardrail; the managed Profanity list is separate and doesn't count against this.
Regex patterns: Limited count per guardrail and bounded complexity — keep them simple (no nested quantifiers).
Versions per guardrail: Soft cap; old versions can be deleted but only if no live caller pins them.
Throughput: ApplyGuardrail has its own per-second TPS quota separate from model TPS — track it independently.

Common Interview Questions:

What filter types does a Bedrock Guardrail support?

Six categories: content filters (hate, insults, sexual, violence, misconduct, prompt-attack) with NONE/LOW/MEDIUM/HIGH thresholds for input and output independently; denied topics defined by name plus natural-language definition and few-shot examples; word filters with the managed Profanity list and a custom-words list; sensitive-information filters for PII (BLOCK or MASK) plus custom regex; contextual grounding for RAG that scores answer faithfulness against retrieved sources; and image content filters for multimodal inputs.

What is the ApplyGuardrail API and why does it matter?

ApplyGuardrail is a standalone synchronous API that evaluates arbitrary text against a guardrail without invoking a model. It lets you enforce the same policy across non-Bedrock providers (OpenAI, Azure OpenAI, self-hosted vLLM), pre-screen inputs before paying for an expensive model call, or post-screen outputs from any source. This is the key to running a multi-provider stack with a single AWS-native policy — the guardrail becomes a portable contract instead of being locked to InvokeModel.

When do you mask PII vs. block it?

Mask (BLOCK strategy = ANONYMIZE) when the downstream task can still proceed with a redacted value — e.g., summarizing a support ticket where the email address can be replaced with {EMAIL}. Block when the presence of PII implies the request shouldn't be served at all — e.g., a public chatbot receiving an SSN should refuse and log the event for review. A common pattern is mask on input (let the model work) and block on output (never let real PII leak back to a user).

What is contextual grounding and how does it work?

Contextual grounding scores each generated response against the retrieved source documents on two axes: grounding (is every claim supported by the sources?) and relevance (does the answer address the user's question?). You set a 0–1 threshold; responses below it are blocked or flagged. It's specifically a RAG safety net that catches hallucinations the content filters wouldn't — the model can produce a polite, non-toxic, totally fabricated answer, and only grounding will catch that. Pair it with KB retrieval and a score-threshold guard for layered defense.

How does Bedrock Guardrails compare to Azure AI Content Safety or OpenAI Moderation?

All three cover hate, sexual, violence, and self-harm classification. Bedrock adds first-class denied-topic definitions, PII redaction, contextual grounding for RAG, and the standalone ApplyGuardrail API; Azure AI Content Safety adds Prompt Shields and Protected Material detection; OpenAI Moderation is free and minimal — just classification. For an AWS-centric stack, Bedrock Guardrails plus a single policy is simplest. For multi-cloud, run Azure Content Safety and Bedrock Guardrails in parallel via a thin gateway and union the results.

How would you operationalize guardrail tuning across many product surfaces?

Treat guardrails as code — store the JSON definition in git, deploy via Terraform or CDK, and version each guardrail (DRAFT plus immutable numbered versions). Maintain a fixture set of "must block" and "must pass" prompts per surface and run them in CI against the candidate version before promoting; treat false-positive and false-negative rates as SLOs. Stream InvokeModel CloudWatch metrics by guardrail outcome (blocked input, blocked output, intervention count) and review weekly; tune topic examples first (high impact, low collateral damage) and content-filter strength last (blunter instrument). Different surfaces (kids' app vs. internal coding assistant) get different guardrail aliases.