Azure OpenAI Service is Microsoft's hosted offering of OpenAI's models — GPT-4o, GPT-4o mini, GPT-4.1, o1, o3-mini, DALL·E 3, Whisper, and the text-embedding-3 family — with enterprise controls on top. It is the recommended path for running OpenAI models when you need Azure-native identity (Entra ID), private networking (VNet, Private Endpoints), regional data residency, and contractual guarantees around data handling and content filtering.
gpt-4o-prod mapped to gpt-4o-2024-11-20).https://<resource>.openai.azure.com/openai/deployments/<deployment>/chat/completions?api-version=2024-10-21The official openai SDK has an AzureOpenAI client that points at your Azure endpoint.
from openai import AzureOpenAI
import os
client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], # https://myco.openai.azure.com/
api_key=os.environ["AZURE_OPENAI_API_KEY"],
api_version="2024-10-21",
)
resp = client.chat.completions.create(
model="gpt-4o-prod", # your DEPLOYMENT name, not the model name
messages=[
{"role": "system", "content": "You are a concise financial analyst."},
{"role": "user", "content": "Summarize Q3 sales trends in 3 bullets."},
],
max_tokens=512,
temperature=0.2,
)
print(resp.choices[0].message.content)
stream = client.chat.completions.create(
model="gpt-4o-prod",
messages=[{"role": "user", "content": "Explain CAP theorem to a new engineer."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content if chunk.choices else None
if delta:
print(delta, end="", flush=True)
tools = [{
"type": "function",
"function": {
"name": "get_order_status",
"description": "Look up the shipping status of a customer order by ID.",
"parameters": {
"type": "object",
"properties": {"order_id": {"type": "string"}},
"required": ["order_id"],
},
},
}]
messages = [{"role": "user", "content": "Where is order A-482?"}]
resp = client.chat.completions.create(model="gpt-4o-prod", messages=messages, tools=tools)
msg = resp.choices[0].message
if msg.tool_calls:
call = msg.tool_calls[0]
# Pretend this calls your order system
tool_result = {"order_id": "A-482", "status": "In transit, ETA Fri"}
messages.append(msg)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": str(tool_result),
})
final = client.chat.completions.create(model="gpt-4o-prod", messages=messages, tools=tools)
print(final.choices[0].message.content)
import base64
with open("chart.png", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
resp = client.chat.completions.create(
model="gpt-4o-prod",
messages=[{"role": "user", "content": [
{"type": "text", "text": "What trend does this chart show? Return one sentence."},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}},
]}],
max_tokens=200,
)
print(resp.choices[0].message.content)
The cleanest production pattern: drop API keys entirely and authenticate with the caller's Azure identity.
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
token_provider = get_bearer_token_provider(
DefaultAzureCredential(),
"https://cognitiveservices.azure.com/.default",
)
client = AzureOpenAI(
azure_endpoint="https://myco.openai.azure.com/",
azure_ad_token_provider=token_provider,
api_version="2024-10-21",
)
resp = client.chat.completions.create(
model="gpt-4o-prod",
messages=[{"role": "user", "content": "Ping"}],
)
print(resp.choices[0].message.content)
Reasoning models spend extra internal "thinking" tokens before responding. They don't accept system messages (use developer role) and use max_completion_tokens instead of max_tokens.
resp = client.chat.completions.create(
model="o3-mini-prod",
messages=[
{"role": "developer", "content": "Think step by step."},
{"role": "user", "content": "A train leaves Chicago at 2pm at 60mph. Another leaves NY at 3pm at 80mph going the opposite way. Distance is 800 miles. When do they meet?"},
],
reasoning_effort="medium", # low | medium | high
max_completion_tokens=4000,
)
print(resp.choices[0].message.content)
vec = client.embeddings.create(
model="embedding-3-large-prod",
input=["Azure OpenAI Service hosts OpenAI models in your Azure tenant."],
dimensions=1536, # optional truncation; default is 3072
).data[0].embedding
print(len(vec), vec[:5])
Azure OpenAI can run RAG against Azure AI Search, Azure Cosmos DB for MongoDB vCore, Azure Blob Storage, or Elasticsearch without you building the retrieval loop.
completion = client.chat.completions.create(
model="gpt-4o-prod",
messages=[{"role": "user", "content": "What is our 2026 parental-leave policy?"}],
extra_body={
"data_sources": [{
"type": "azure_search",
"parameters": {
"endpoint": "https://myco-search.search.windows.net",
"index_name": "hr-policies",
"authentication": {"type": "system_assigned_managed_identity"},
"query_type": "vector_semantic_hybrid",
"embedding_dependency": {
"type": "deployment_name",
"deployment_name": "embedding-3-large-prod",
},
"semantic_configuration": "default",
},
}],
},
)
print(completion.choices[0].message.content)
for ctx in completion.choices[0].message.context.get("citations", []):
print("-", ctx["title"], ctx.get("url"))
img = client.images.generate(
model="dalle3-prod",
prompt="A watercolor illustration of a quiet mountain lake at sunrise.",
size="1024x1024",
quality="hd",
n=1,
)
print(img.data[0].url)
with open("meeting.m4a", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-prod",
file=f,
response_format="verbose_json",
timestamp_granularities=["segment"],
)
print(transcript.text)
Submit a JSONL file of requests; Azure processes them asynchronously at half price.
# 1) Upload the JSONL file
batch_file = client.files.create(file=open("requests.jsonl", "rb"), purpose="batch")
# 2) Create the batch job
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/chat/completions",
completion_window="24h",
)
print("batch id:", batch.id, "status:", batch.status)
# 3) Poll (or event-drive); when completed, download the output_file_id
Every call runs through Azure's content filter. Check resp.prompt_filter_results and choices[0].content_filter_results to see per-category scores. Enable Prompt Shields on a deployment to block jailbreak attempts and indirect prompt injections from retrieved documents. Enable Groundedness Detection on RAG responses to flag hallucinations.
text-embedding-3-large with truncated dimensions (e.g., 1024) over ada-002 — better quality per dollar.