Flask's built-in flask run / app.run() server is the
Werkzeug development server. It is single-threaded by default, does not
implement robust process supervision, leaks memory under sustained load, and
explicitly prints a warning telling you not to use it in production. A real
production deployment separates concerns into four layers:
A typical production stack looks like: client → ALB (TLS) → Nginx
→ Gunicorn (unix socket) → Flask, with Postgres and Redis behind the
app, everything containerized and scheduled by Kubernetes.
Choice of WSGI server is largely driven by workload shape (CPU-bound vs I/O-bound) and whether you need async.
| Server | Worker Model | Best For | Notes |
|---|---|---|---|
| Gunicorn | Pre-fork (sync / gthread / gevent / eventlet) | General-purpose Flask/Django on Linux | De facto Python standard; simple config; excellent signal handling. |
| uWSGI | Pre-fork + threads; emperor mode | Multi-app hosting, tight resource caps | Very feature-rich (200+ options); steep learning curve; dev pace has slowed. |
| Waitress | Single process, thread pool | Windows deployments, simple internal tools | Pure Python, no C deps; cross-platform; lower throughput than Gunicorn. |
| Hypercorn | ASGI (asyncio / trio / uvloop) | Flask 2+ async views, HTTP/2, WebSockets | Required if you use async def views; supports HTTP/3 (experimental). |
| mod_wsgi | Embedded in Apache httpd | Legacy Apache shops | Rarely the right choice for new deployments; couples app to Apache lifecycle. |
Rule of thumb: start with Gunicorn + gthread workers.
Switch to gevent only if profiling shows I/O-bound bottlenecks (lots of
downstream HTTP calls, slow DB queries). Reach for Hypercorn only if the codebase is
genuinely async-native.
Drop a gunicorn.conf.py next to your app. Keeping it as Python
(rather than CLI flags) makes it version-controllable and lets you compute values at
startup.
# gunicorn.conf.py
import multiprocessing
import os
# Server socket
bind = os.environ.get("GUNICORN_BIND", "unix:/run/gunicorn/app.sock")
backlog = 2048
# Worker processes
# Rule of thumb: (2 x $num_cores) + 1 for sync / gthread workers.
workers = int(os.environ.get(
"GUNICORN_WORKERS",
(multiprocessing.cpu_count() * 2) + 1,
))
worker_class = os.environ.get("GUNICORN_WORKER_CLASS", "gthread")
threads = int(os.environ.get("GUNICORN_THREADS", 4))
worker_connections = 1000
# Timeouts
timeout = 30 # kill workers that block for >30s
graceful_timeout = 30 # drain window on SIGTERM
keepalive = 5 # behind nginx this is fine; bump to 75 behind an ALB
# Recycle workers to contain slow memory leaks / fragmentation
max_requests = 1000
max_requests_jitter = 100
# Load the app before forking workers so shared code lives in
# copy-on-write memory (saves RAM with many workers).
preload_app = True
# Logging — send everything to stdout/stderr; let the container
# runtime / systemd ship logs to the aggregator.
accesslog = "-"
errorlog = "-"
loglevel = os.environ.get("GUNICORN_LOGLEVEL", "info")
access_log_format = (
'%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s '
'"%(f)s" "%(a)s" %(L)s %({x-request-id}i)s'
)
# Process naming (easier to spot in `ps` / `top`)
proc_name = "flask-app"
# Lifecycle hooks
def post_fork(server, worker):
server.log.info("Worker spawned (pid: %s)", worker.pid)
def worker_int(worker):
worker.log.info("Worker received INT or QUIT signal")
def on_exit(server):
server.log.info("Shutting down master")
Worker class guidance:
sync — one request per worker at a time. CPU-bound work, ML
inference, anything that holds the GIL. Simplest, most predictable.gthread — thread pool per worker. Good default for mixed
workloads; 4 threads per worker is a reasonable start.gevent — cooperative greenlets; huge concurrency for I/O-bound
code. Requires monkey-patching and DB drivers that cooperate. Great for proxy-style
services, dangerous for CPU-heavy or C-extension code that doesn't release the GIL.uvicorn.workers.UvicornWorker — ASGI under Gunicorn for Flask
2+ async views.Start Gunicorn with:
gunicorn --config gunicorn.conf.py "myapp:create_app()"
Nginx terminates TLS, buffers slow clients so Gunicorn workers don't stall, serves static assets directly, and adds security headers. The Flask app never faces the public internet.
# /etc/nginx/sites-available/flask-app
upstream flask_app {
server unix:/run/gunicorn/app.sock fail_timeout=0;
keepalive 32;
}
# HTTP → HTTPS redirect
server {
listen 80;
listen [::]:80;
server_name api.example.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name api.example.com;
# TLS
ssl_certificate /etc/letsencrypt/live/api.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/api.example.com/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 10m;
# Request size — tune per endpoint; default low for safety
client_max_body_size 10m;
client_body_timeout 30s;
client_header_timeout 10s;
send_timeout 30s;
# Buffers (protect upstream from slow clients / header floods)
client_body_buffer_size 128k;
client_header_buffer_size 4k;
large_client_header_buffers 4 16k;
# Gzip (brotli is better if module available)
gzip on;
gzip_vary on;
gzip_min_length 1024;
gzip_proxied any;
gzip_comp_level 6;
gzip_types text/plain text/css application/json application/javascript
text/xml application/xml application/xml+rss text/javascript;
# Security headers
add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload" always;
add_header X-Frame-Options "DENY" always;
add_header X-Content-Type-Options "nosniff" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Permissions-Policy "geolocation=(), microphone=(), camera=()" always;
add_header Content-Security-Policy "default-src 'self'; frame-ancestors 'none'" always;
# Static files served directly by nginx
location /static/ {
alias /srv/flask-app/static/;
expires 30d;
add_header Cache-Control "public, immutable";
access_log off;
}
# Health check — do not log to keep access log clean
location = /healthz {
proxy_pass http://flask_app;
access_log off;
}
# Everything else → Gunicorn
location / {
proxy_pass http://flask_app;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Request-ID $request_id;
proxy_set_header Connection "";
proxy_connect_timeout 5s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
proxy_buffering on;
proxy_buffer_size 8k;
proxy_buffers 8 16k;
proxy_busy_buffers_size 32k;
proxy_redirect off;
}
}
Behind Nginx, remember to enable ProxyFix in Flask so
request.remote_addr reflects X-Forwarded-For:
from werkzeug.middleware.proxy_fix import ProxyFix
app.wsgi_app = ProxyFix(app.wsgi_app, x_for=1, x_proto=1, x_host=1, x_prefix=1)
Multi-stage builds keep the runtime image small and free of compilers. Run as a
non-root user, add a HEALTHCHECK, and order layers for cache hits.
# Dockerfile
# ---- Stage 1: builder ------------------------------------------------
FROM python:3.11-slim AS builder
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential gcc libpq-dev \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /build
COPY requirements.txt .
RUN pip install --prefix=/install -r requirements.txt
# ---- Stage 2: runtime ------------------------------------------------
FROM python:3.11-slim AS runtime
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PATH="/install/bin:$PATH" \
PYTHONPATH="/install/lib/python3.11/site-packages"
RUN apt-get update && apt-get install -y --no-install-recommends \
libpq5 curl \
&& rm -rf /var/lib/apt/lists/* \
&& groupadd --system app && useradd --system --gid app --home /app app
COPY --from=builder /install /install
WORKDIR /app
COPY --chown=app:app . .
USER app
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD curl -fsS http://127.0.0.1:8000/healthz || exit 1
CMD ["gunicorn", "--config", "gunicorn.conf.py", "myapp:create_app()"]
Companion .dockerignore — cuts build context and prevents secrets
from leaking into layers:
.git
.gitignore
.env
.env.*
.venv
__pycache__
*.pyc
*.pyo
.pytest_cache
.mypy_cache
.coverage
htmlcov/
tests/
docs/
*.md
Dockerfile
docker-compose*.yml
.github/
.vscode/
.idea/
Compose file for dev/staging parity — identical image, real Nginx, real Postgres, real Redis. Good enough to catch 90% of "works on my laptop" issues.
# docker-compose.yml
version: "3.9"
services:
app:
build: .
image: flask-app:local
environment:
FLASK_ENV: production
DATABASE_URL: postgresql://app:app@postgres:5432/app
REDIS_URL: redis://redis:6379/0
SECRET_KEY: ${SECRET_KEY:?SECRET_KEY required}
GUNICORN_BIND: "0.0.0.0:8000"
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_healthy
expose:
- "8000"
restart: unless-stopped
nginx:
image: nginx:1.27-alpine
ports:
- "80:80"
- "443:443"
volumes:
- ./deploy/nginx.conf:/etc/nginx/conf.d/default.conf:ro
- ./deploy/certs:/etc/letsencrypt:ro
- static:/srv/flask-app/static:ro
depends_on:
- app
restart: unless-stopped
postgres:
image: postgres:16-alpine
environment:
POSTGRES_USER: app
POSTGRES_PASSWORD: app
POSTGRES_DB: app
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U app"]
interval: 5s
timeout: 3s
retries: 10
restart: unless-stopped
redis:
image: redis:7-alpine
command: ["redis-server", "--save", "60", "1", "--loglevel", "warning"]
volumes:
- redisdata:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 10
restart: unless-stopped
volumes:
pgdata:
redisdata:
static:
For anything running at real scale, Kubernetes is the target. Key objects: a
Deployment with rolling update + probes, a Service, an
Ingress, and an HorizontalPodAutoscaler.
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: flask-app
labels: { app: flask-app }
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # zero-downtime; always keep 3 healthy
selector:
matchLabels: { app: flask-app }
template:
metadata:
labels: { app: flask-app }
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
terminationGracePeriodSeconds: 45 # > gunicorn graceful_timeout
containers:
- name: app
image: ghcr.io/acme/flask-app:1.42.0
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8000
name: http
env:
- name: GUNICORN_BIND
value: "0.0.0.0:8000"
- name: DATABASE_URL
valueFrom:
secretKeyRef: { name: flask-app-secrets, key: database_url }
- name: SECRET_KEY
valueFrom:
secretKeyRef: { name: flask-app-secrets, key: secret_key }
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "1", memory: "512Mi" }
readinessProbe:
httpGet: { path: /healthz/ready, port: http }
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet: { path: /healthz/live, port: http }
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
lifecycle:
preStop:
exec:
# Let the service endpoints controller remove this pod
# from rotation before gunicorn starts shutting down.
command: ["sh", "-c", "sleep 10"]
---
apiVersion: v1
kind: Service
metadata:
name: flask-app
spec:
selector: { app: flask-app }
ports:
- port: 80
targetPort: http
name: http
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: flask-app
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
spec:
ingressClassName: nginx
tls:
- hosts: [ api.example.com ]
secretName: flask-app-tls
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: flask-app
port: { name: http }
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: flask-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: flask-app
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target: { type: Utilization, averageUtilization: 70 }
- type: Resource
resource:
name: memory
target: { type: Utilization, averageUtilization: 80 }
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 60
Two probes with different semantics matter: /healthz/live
returns 200 while the process is alive (used to restart deadlocked pods), and
/healthz/ready returns 200 only when DB + Redis + downstream deps are
reachable (used to gate traffic).
Follow the 12-factor rule: all config via environment. Flask config classes keep per-environment defaults versioned; secrets stay out of git.
# config.py
import os
class BaseConfig:
SECRET_KEY = os.environ["SECRET_KEY"] # required
SQLALCHEMY_DATABASE_URI = os.environ["DATABASE_URL"]
SQLALCHEMY_TRACK_MODIFICATIONS = False
SQLALCHEMY_ENGINE_OPTIONS = {
"pool_size": 10,
"max_overflow": 20,
"pool_pre_ping": True,
"pool_recycle": 1800,
}
REDIS_URL = os.environ["REDIS_URL"]
SESSION_COOKIE_SECURE = True
SESSION_COOKIE_HTTPONLY = True
SESSION_COOKIE_SAMESITE = "Lax"
PREFERRED_URL_SCHEME = "https"
class DevConfig(BaseConfig):
DEBUG = True
SESSION_COOKIE_SECURE = False # http on localhost
class ProdConfig(BaseConfig):
DEBUG = False
TESTING = False
def load(app):
env = os.environ.get("FLASK_ENV", "production")
app.config.from_object({"development": DevConfig,
"production": ProdConfig}[env])
Secrets precedence (prod):
Secret objects — encrypted at rest with KMS..env files in the image, hard-coded constants, or secrets in
Git.For local dev only: python-dotenv loads .env, which is
listed in both .gitignore and .dockerignore.
Containers log to stdout/stderr as a single JSON object per line. The platform (Docker, Kubernetes, ECS) ships them to CloudWatch / ELK / Loki. Correlation IDs let you stitch a single request across Nginx, Flask, and downstream services.
# logging_setup.py
import logging
import os
import uuid
from flask import g, request
from pythonjsonlogger import jsonlogger
def configure_logging(app):
handler = logging.StreamHandler()
fmt = jsonlogger.JsonFormatter(
"%(asctime)s %(levelname)s %(name)s %(message)s "
"%(request_id)s %(user_id)s %(path)s %(status)s %(duration_ms)s",
rename_fields={"asctime": "ts", "levelname": "level"},
)
handler.setFormatter(fmt)
root = logging.getLogger()
root.handlers = [handler]
root.setLevel(os.environ.get("LOG_LEVEL", "INFO"))
# Quiet noisy libs
logging.getLogger("urllib3").setLevel(logging.WARNING)
logging.getLogger("botocore").setLevel(logging.WARNING)
@app.before_request
def _req_start():
g.request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
g._t0 = time.monotonic()
@app.after_request
def _req_end(resp):
dur_ms = int((time.monotonic() - g._t0) * 1000)
app.logger.info(
"request",
extra={
"request_id": g.request_id,
"user_id": getattr(g, "user_id", None),
"path": request.path,
"method": request.method,
"status": resp.status_code,
"duration_ms": dur_ms,
},
)
resp.headers["X-Request-ID"] = g.request_id
return resp
Three pillars, three tools:
prometheus_flask_exporter exposes
/metrics with per-endpoint latency histograms, status-code counters,
and in-flight gauges. Scraped by Prometheus, visualized in Grafana.from prometheus_flask_exporter import PrometheusMetrics
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
import sentry_sdk
from sentry_sdk.integrations.flask import FlaskIntegration
def configure_observability(app):
metrics = PrometheusMetrics(app, group_by="endpoint")
metrics.info("app_info", "Flask app", version=os.environ.get("APP_VERSION", "dev"))
FlaskInstrumentor().instrument_app(app)
SQLAlchemyInstrumentor().instrument(engine=app.extensions["sqlalchemy"].engine)
sentry_sdk.init(
dsn=os.environ.get("SENTRY_DSN"),
integrations=[FlaskIntegration()],
traces_sample_rate=0.05,
profiles_sample_rate=0.01,
environment=os.environ.get("FLASK_ENV", "production"),
release=os.environ.get("APP_VERSION"),
)
SECRET_KEY — 32+ random bytes, rotated at
least annually. Rotation invalidates sessions; use a signed-fallback
SECRET_KEY_FALLBACKS list (Flask 2.3+) to roll over without locking
everyone out.CSRFProtect for any
cookie-authenticated form endpoint. Not needed for pure JSON APIs using bearer
tokens.SESSION_COOKIE_SECURE=True, HTTPONLY=True,
SAMESITE="Lax" (or "Strict").pip-audit and
safety in CI; fail the build on known CVEs. Add Dependabot / Renovate
for automated PRs.Flask-Migrate wraps Alembic. The hard part is not the tool — it's making migrations safe to run while the old version of the app is still serving traffic.
Expand-contract (three deploys per breaking change):
nullable=True, no default for large tables — fill via backfill
job). Deploy app that writes to both old and new.For Postgres: use CREATE INDEX CONCURRENTLY, avoid
ALTER TABLE ... ADD COLUMN NOT NULL DEFAULT on large tables (pre-14
rewrites the whole table), and set lock_timeout on migration sessions
so a stuck migration doesn't freeze production.
Gunicorn handles SIGTERM by closing the listening socket, stopping
new request acceptance, and giving workers graceful_timeout seconds to
finish in-flight requests. Getting this right in Kubernetes requires coordinating:
terminationGracePeriodSeconds (pod) >
graceful_timeout (gunicorn) > p99 request duration. E.g. 45s / 30s /
10s.preStop hook that sleeps long enough for the endpoints
controller to remove the pod from the Service's endpoint list before Gunicorn
starts rejecting connections. 10–15 seconds is usually sufficient.deregistration_delay, Nginx worker_shutdown_timeout).preload_app=True tradeoff — saves memory via
copy-on-write but means workers share module-level state. Anything that opens a
network connection at import time (DB pools, Kafka producers) must be re-opened in
post_fork hooks, otherwise forked workers share the parent's socket
and you get bizarre cross-talk bugs.Vertical — more CPU / RAM per pod, more Gunicorn workers. Hits diminishing returns: the GIL, DB connection pool exhaustion, and NUMA effects all cap per-pod throughput. Rule of thumb: 2–4 CPU per pod, then scale out.
Horizontal — more pods behind the Service. Scales linearly until the database becomes the bottleneck. Plan for this: read replicas, connection pooler (PgBouncer in transaction mode), cache-aside with Redis, materialized views for heavy reads.
Sessions — do not use Flask's default client-side signed
cookie for sessions of any real size. Move to server-side sessions backed by Redis
via Flask-Session; pods become stateless and any pod can handle any
request (no sticky sessions required).
from flask_session import Session
app.config.update(
SESSION_TYPE="redis",
SESSION_REDIS=redis.from_url(os.environ["REDIS_URL"]),
SESSION_USE_SIGNER=True,
SESSION_KEY_PREFIX="sess:",
PERMANENT_SESSION_LIFETIME=timedelta(hours=8),
)
Session(app)
(2 × CPU) + 1;
load-test with wrk or k6 and tune. Too few → queueing
latency; too many → context-switch overhead and memory bloat. Watch
gunicorn_requests_duration_seconds p95 and in-flight saturation.gthread with 4–8 threads
gives cheap concurrency for CPU-lite endpoints without the debugging complexity of
gevent's monkey-patching.pool_size
× worker count must not exceed Postgres max_connections. With
4 pods × 9 workers × 10 pool = 360 connections — put PgBouncer in
front if max_connections is tight.keepalive_timeout higher than the ALB's idle timeout (default 60s) to
avoid 502s from race conditions on connection close.A well-tuned Flask pod on 1 vCPU / 512MB with gthread workers can comfortably handle 200–500 req/s at sub-50ms p95 for DB-backed JSON endpoints. When you need more, scale horizontally first — it is almost always cheaper than chasing micro-optimizations.