Azure OpenAI Service

Azure OpenAI Service is Microsoft's hosted offering of OpenAI's models — GPT-4o, GPT-4o mini, GPT-4.1, o1, o3-mini, DALL·E 3, Whisper, and the text-embedding-3 family — with enterprise controls on top. It is the recommended path for running OpenAI models when you need Azure-native identity (Entra ID), private networking (VNet, Private Endpoints), regional data residency, and contractual guarantees around data handling and content filtering.

Azure OpenAI vs. OpenAI's Own API:

Models: Same underlying weights, sometimes a version behind OpenAI's latest. Azure adds regional deployments and Provisioned Throughput Units (PTUs).
Identity & Networking: Entra ID (managed identity) auth instead of API keys; Private Endpoints keep traffic on the Microsoft backbone.
Data Handling: Prompts and completions are not used to train OpenAI's models and are processed in your selected Azure region.
Content Filtering: Built-in category filters (hate, self-harm, sexual, violence) plus Prompt Shields for injection/jailbreak detection and Groundedness for RAG hallucination scoring.
SLAs & Compliance: Azure's enterprise SLA, SOC 2, ISO 27001, HIPAA, FedRAMP High in GovCloud.
When to pick OpenAI directly: Access to the newest models on day one, slightly lower latency, and developer-centric features (Assistants API evolutions, Realtime API) often land there first.

Key Concepts:

Resource & Deployment: Create an Azure OpenAI resource, then deploy a named instance of a model (e.g., deployment name gpt-4o-prod mapped to gpt-4o-2024-11-20).
Endpoint URL: https://<resource>.openai.azure.com/openai/deployments/<deployment>/chat/completions?api-version=2024-10-21
Capacity Modes: Pay-as-you-go (token-based, shared pool, standard quotas) or Provisioned Throughput Units (PTUs) — reserved capacity with deterministic latency.
Global vs. Regional Deployments: Global standard/batch routes traffic to best region for price/latency; regional keeps processing in a specific Azure region for data-residency compliance.
Content Filtering Policies: Per-deployment filter severity (low/medium/high/off, subject to Microsoft approval for "off").

Examples

1. Chat Completion with GPT-4o (Python)

The official openai SDK has an AzureOpenAI client that points at your Azure endpoint.


from openai import AzureOpenAI
import os

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],   # https://myco.openai.azure.com/
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version="2024-10-21",
)

resp = client.chat.completions.create(
    model="gpt-4o-prod",  # your DEPLOYMENT name, not the model name
    messages=[
        {"role": "system", "content": "You are a concise financial analyst."},
        {"role": "user",   "content": "Summarize Q3 sales trends in 3 bullets."},
    ],
    max_tokens=512,
    temperature=0.2,
)

print(resp.choices[0].message.content)

2. Streaming Responses for Chat UIs


stream = client.chat.completions.create(
    model="gpt-4o-prod",
    messages=[{"role": "user", "content": "Explain CAP theorem to a new engineer."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content if chunk.choices else None
    if delta:
        print(delta, end="", flush=True)

3. Tool Calling (Function Calling)


tools = [{
    "type": "function",
    "function": {
        "name": "get_order_status",
        "description": "Look up the shipping status of a customer order by ID.",
        "parameters": {
            "type": "object",
            "properties": {"order_id": {"type": "string"}},
            "required": ["order_id"],
        },
    },
}]

messages = [{"role": "user", "content": "Where is order A-482?"}]
resp = client.chat.completions.create(model="gpt-4o-prod", messages=messages, tools=tools)

msg = resp.choices[0].message
if msg.tool_calls:
    call = msg.tool_calls[0]
    # Pretend this calls your order system
    tool_result = {"order_id": "A-482", "status": "In transit, ETA Fri"}

    messages.append(msg)
    messages.append({
        "role": "tool",
        "tool_call_id": call.id,
        "content": str(tool_result),
    })
    final = client.chat.completions.create(model="gpt-4o-prod", messages=messages, tools=tools)
    print(final.choices[0].message.content)

4. GPT-4o with Vision


import base64

with open("chart.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

resp = client.chat.completions.create(
    model="gpt-4o-prod",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "What trend does this chart show? Return one sentence."},
        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
    ]}],
    max_tokens=200,
)
print(resp.choices[0].message.content)

5. Entra ID (Managed Identity) Authentication — No API Keys

The cleanest production pattern: drop API keys entirely and authenticate with the caller's Azure identity.


from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(),
    "https://cognitiveservices.azure.com/.default",
)

client = AzureOpenAI(
    azure_endpoint="https://myco.openai.azure.com/",
    azure_ad_token_provider=token_provider,
    api_version="2024-10-21",
)

resp = client.chat.completions.create(
    model="gpt-4o-prod",
    messages=[{"role": "user", "content": "Ping"}],
)
print(resp.choices[0].message.content)

6. Reasoning Models (o1 / o3-mini)

Reasoning models spend extra internal "thinking" tokens before responding. They don't accept system messages (use developer role) and use max_completion_tokens instead of max_tokens.


resp = client.chat.completions.create(
    model="o3-mini-prod",
    messages=[
        {"role": "developer", "content": "Think step by step."},
        {"role": "user", "content": "A train leaves Chicago at 2pm at 60mph. Another leaves NY at 3pm at 80mph going the opposite way. Distance is 800 miles. When do they meet?"},
    ],
    reasoning_effort="medium",       # low | medium | high
    max_completion_tokens=4000,
)
print(resp.choices[0].message.content)

7. Embeddings with text-embedding-3-large


vec = client.embeddings.create(
    model="embedding-3-large-prod",
    input=["Azure OpenAI Service hosts OpenAI models in your Azure tenant."],
    dimensions=1536,   # optional truncation; default is 3072
).data[0].embedding

print(len(vec), vec[:5])

8. RAG via "On Your Data" (Data Sources parameter)

Azure OpenAI can run RAG against Azure AI Search, Azure Cosmos DB for MongoDB vCore, Azure Blob Storage, or Elasticsearch without you building the retrieval loop.


completion = client.chat.completions.create(
    model="gpt-4o-prod",
    messages=[{"role": "user", "content": "What is our 2026 parental-leave policy?"}],
    extra_body={
        "data_sources": [{
            "type": "azure_search",
            "parameters": {
                "endpoint": "https://myco-search.search.windows.net",
                "index_name": "hr-policies",
                "authentication": {"type": "system_assigned_managed_identity"},
                "query_type": "vector_semantic_hybrid",
                "embedding_dependency": {
                    "type": "deployment_name",
                    "deployment_name": "embedding-3-large-prod",
                },
                "semantic_configuration": "default",
            },
        }],
    },
)
print(completion.choices[0].message.content)
for ctx in completion.choices[0].message.context.get("citations", []):
    print("-", ctx["title"], ctx.get("url"))

9. DALL·E 3 Image Generation


img = client.images.generate(
    model="dalle3-prod",
    prompt="A watercolor illustration of a quiet mountain lake at sunrise.",
    size="1024x1024",
    quality="hd",
    n=1,
)
print(img.data[0].url)

10. Whisper Speech-to-Text


with open("meeting.m4a", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-prod",
        file=f,
        response_format="verbose_json",
        timestamp_granularities=["segment"],
    )
print(transcript.text)

11. Batch API (50% Cheaper, 24-Hour SLA)

Submit a JSONL file of requests; Azure processes them asynchronously at half price.


# 1) Upload the JSONL file
batch_file = client.files.create(file=open("requests.jsonl", "rb"), purpose="batch")

# 2) Create the batch job
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/chat/completions",
    completion_window="24h",
)
print("batch id:", batch.id, "status:", batch.status)

# 3) Poll (or event-drive); when completed, download the output_file_id

Content Filtering & Prompt Shields

Every call runs through Azure's content filter. Check resp.prompt_filter_results and choices[0].content_filter_results to see per-category scores. Enable Prompt Shields on a deployment to block jailbreak attempts and indirect prompt injections from retrieved documents. Enable Groundedness Detection on RAG responses to flag hallucinations.

Cost Optimization:

Use GPT-4o mini for classification, routing, and simple tasks — ~15× cheaper than GPT-4o.
Enable prompt caching (automatic for eligible models) — repeated prompt prefixes cost ~50% less.
Use the Batch API for non-interactive workloads — 50% discount.
Move predictable, high-throughput workloads to PTUs once you can forecast tokens-per-minute.
Prefer text-embedding-3-large with truncated dimensions (e.g., 1024) over ada-002 — better quality per dollar.