Flask — Production Deployment

1. Overview

Flask's built-in flask run / app.run() server is the Werkzeug development server. It is single-threaded by default, does not implement robust process supervision, leaks memory under sustained load, and explicitly prints a warning telling you not to use it in production. A real production deployment separates concerns into four layers:

WSGI application server — Gunicorn or uWSGI. Runs the Flask app in multiple worker processes, handles graceful shutdown, enforces request timeouts, and recycles workers to contain memory leaks.
Reverse proxy — Nginx (or an ALB/ELB in AWS). Terminates TLS, serves static files, buffers slow clients, enforces request size limits, sets security headers, and round-robins between WSGI workers over a Unix socket.
Process supervisor — systemd, Docker, or Kubernetes. Restarts crashed processes, delivers SIGTERM for graceful shutdown, and manages environment variables.
Observability — structured JSON logs, Prometheus metrics, distributed traces, Sentry for exceptions.

A typical production stack looks like: client → ALB (TLS) → Nginx → Gunicorn (unix socket) → Flask, with Postgres and Redis behind the app, everything containerized and scheduled by Kubernetes.

2. WSGI Servers

Choice of WSGI server is largely driven by workload shape (CPU-bound vs I/O-bound) and whether you need async.

Server	Worker Model	Best For	Notes
Gunicorn	Pre-fork (sync / gthread / gevent / eventlet)	General-purpose Flask/Django on Linux	De facto Python standard; simple config; excellent signal handling.
uWSGI	Pre-fork + threads; emperor mode	Multi-app hosting, tight resource caps	Very feature-rich (200+ options); steep learning curve; dev pace has slowed.
Waitress	Single process, thread pool	Windows deployments, simple internal tools	Pure Python, no C deps; cross-platform; lower throughput than Gunicorn.
Hypercorn	ASGI (asyncio / trio / uvloop)	Flask 2+ async views, HTTP/2, WebSockets	Required if you use `async def` views; supports HTTP/3 (experimental).
mod_wsgi	Embedded in Apache httpd	Legacy Apache shops	Rarely the right choice for new deployments; couples app to Apache lifecycle.

Rule of thumb: start with Gunicorn + gthread workers. Switch to gevent only if profiling shows I/O-bound bottlenecks (lots of downstream HTTP calls, slow DB queries). Reach for Hypercorn only if the codebase is genuinely async-native.

3. Gunicorn Configuration

Drop a gunicorn.conf.py next to your app. Keeping it as Python (rather than CLI flags) makes it version-controllable and lets you compute values at startup.

# gunicorn.conf.py
import multiprocessing
import os

# Server socket
bind = os.environ.get("GUNICORN_BIND", "unix:/run/gunicorn/app.sock")
backlog = 2048

# Worker processes
# Rule of thumb: (2 x $num_cores) + 1 for sync / gthread workers.
workers = int(os.environ.get(
    "GUNICORN_WORKERS",
    (multiprocessing.cpu_count() * 2) + 1,
))
worker_class = os.environ.get("GUNICORN_WORKER_CLASS", "gthread")
threads = int(os.environ.get("GUNICORN_THREADS", 4))
worker_connections = 1000

# Timeouts
timeout = 30          # kill workers that block for >30s
graceful_timeout = 30  # drain window on SIGTERM
keepalive = 5         # behind nginx this is fine; bump to 75 behind an ALB

# Recycle workers to contain slow memory leaks / fragmentation
max_requests = 1000
max_requests_jitter = 100

# Load the app before forking workers so shared code lives in
# copy-on-write memory (saves RAM with many workers).
preload_app = True

# Logging — send everything to stdout/stderr; let the container
# runtime / systemd ship logs to the aggregator.
accesslog = "-"
errorlog = "-"
loglevel = os.environ.get("GUNICORN_LOGLEVEL", "info")
access_log_format = (
    '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s '
    '"%(f)s" "%(a)s" %(L)s %({x-request-id}i)s'
)

# Process naming (easier to spot in `ps` / `top`)
proc_name = "flask-app"

# Lifecycle hooks
def post_fork(server, worker):
    server.log.info("Worker spawned (pid: %s)", worker.pid)

def worker_int(worker):
    worker.log.info("Worker received INT or QUIT signal")

def on_exit(server):
    server.log.info("Shutting down master")

Worker class guidance:

sync — one request per worker at a time. CPU-bound work, ML inference, anything that holds the GIL. Simplest, most predictable.
gthread — thread pool per worker. Good default for mixed workloads; 4 threads per worker is a reasonable start.
gevent — cooperative greenlets; huge concurrency for I/O-bound code. Requires monkey-patching and DB drivers that cooperate. Great for proxy-style services, dangerous for CPU-heavy or C-extension code that doesn't release the GIL.
uvicorn.workers.UvicornWorker — ASGI under Gunicorn for Flask 2+ async views.

Start Gunicorn with:

gunicorn --config gunicorn.conf.py "myapp:create_app()"

4. Nginx as Reverse Proxy

Nginx terminates TLS, buffers slow clients so Gunicorn workers don't stall, serves static assets directly, and adds security headers. The Flask app never faces the public internet.

# /etc/nginx/sites-available/flask-app
upstream flask_app {
    server unix:/run/gunicorn/app.sock fail_timeout=0;
    keepalive 32;
}

# HTTP → HTTPS redirect
server {
    listen 80;
    listen [::]:80;
    server_name api.example.com;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2;
    server_name api.example.com;

    # TLS
    ssl_certificate     /etc/letsencrypt/live/api.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/api.example.com/privkey.pem;
    ssl_protocols       TLSv1.2 TLSv1.3;
    ssl_ciphers         ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
    ssl_prefer_server_ciphers on;
    ssl_session_cache   shared:SSL:10m;
    ssl_session_timeout 10m;

    # Request size — tune per endpoint; default low for safety
    client_max_body_size 10m;
    client_body_timeout  30s;
    client_header_timeout 10s;
    send_timeout         30s;

    # Buffers (protect upstream from slow clients / header floods)
    client_body_buffer_size    128k;
    client_header_buffer_size  4k;
    large_client_header_buffers 4 16k;

    # Gzip (brotli is better if module available)
    gzip             on;
    gzip_vary        on;
    gzip_min_length  1024;
    gzip_proxied     any;
    gzip_comp_level  6;
    gzip_types       text/plain text/css application/json application/javascript
                     text/xml application/xml application/xml+rss text/javascript;

    # Security headers
    add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload" always;
    add_header X-Frame-Options            "DENY" always;
    add_header X-Content-Type-Options     "nosniff" always;
    add_header Referrer-Policy            "strict-origin-when-cross-origin" always;
    add_header Permissions-Policy         "geolocation=(), microphone=(), camera=()" always;
    add_header Content-Security-Policy    "default-src 'self'; frame-ancestors 'none'" always;

    # Static files served directly by nginx
    location /static/ {
        alias /srv/flask-app/static/;
        expires 30d;
        add_header Cache-Control "public, immutable";
        access_log off;
    }

    # Health check — do not log to keep access log clean
    location = /healthz {
        proxy_pass http://flask_app;
        access_log off;
    }

    # Everything else → Gunicorn
    location / {
        proxy_pass         http://flask_app;
        proxy_http_version 1.1;
        proxy_set_header   Host              $host;
        proxy_set_header   X-Real-IP         $remote_addr;
        proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto $scheme;
        proxy_set_header   X-Request-ID      $request_id;
        proxy_set_header   Connection        "";

        proxy_connect_timeout 5s;
        proxy_send_timeout    30s;
        proxy_read_timeout    30s;

        proxy_buffering       on;
        proxy_buffer_size     8k;
        proxy_buffers         8 16k;
        proxy_busy_buffers_size 32k;

        proxy_redirect        off;
    }
}

Behind Nginx, remember to enable ProxyFix in Flask so request.remote_addr reflects X-Forwarded-For:

from werkzeug.middleware.proxy_fix import ProxyFix
app.wsgi_app = ProxyFix(app.wsgi_app, x_for=1, x_proto=1, x_host=1, x_prefix=1)

5. Docker

Multi-stage builds keep the runtime image small and free of compilers. Run as a non-root user, add a HEALTHCHECK, and order layers for cache hits.

# Dockerfile
# ---- Stage 1: builder ------------------------------------------------
FROM python:3.11-slim AS builder

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1

RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential gcc libpq-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /build
COPY requirements.txt .
RUN pip install --prefix=/install -r requirements.txt

# ---- Stage 2: runtime ------------------------------------------------
FROM python:3.11-slim AS runtime

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PATH="/install/bin:$PATH" \
    PYTHONPATH="/install/lib/python3.11/site-packages"

RUN apt-get update && apt-get install -y --no-install-recommends \
        libpq5 curl \
    && rm -rf /var/lib/apt/lists/* \
    && groupadd --system app && useradd --system --gid app --home /app app

COPY --from=builder /install /install

WORKDIR /app
COPY --chown=app:app . .

USER app
EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
    CMD curl -fsS http://127.0.0.1:8000/healthz || exit 1

CMD ["gunicorn", "--config", "gunicorn.conf.py", "myapp:create_app()"]

Companion .dockerignore — cuts build context and prevents secrets from leaking into layers:

.git
.gitignore
.env
.env.*
.venv
__pycache__
*.pyc
*.pyo
.pytest_cache
.mypy_cache
.coverage
htmlcov/
tests/
docs/
*.md
Dockerfile
docker-compose*.yml
.github/
.vscode/
.idea/

6. Docker Compose

Compose file for dev/staging parity — identical image, real Nginx, real Postgres, real Redis. Good enough to catch 90% of "works on my laptop" issues.

# docker-compose.yml
version: "3.9"

services:
  app:
    build: .
    image: flask-app:local
    environment:
      FLASK_ENV: production
      DATABASE_URL: postgresql://app:app@postgres:5432/app
      REDIS_URL: redis://redis:6379/0
      SECRET_KEY: ${SECRET_KEY:?SECRET_KEY required}
      GUNICORN_BIND: "0.0.0.0:8000"
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    expose:
      - "8000"
    restart: unless-stopped

  nginx:
    image: nginx:1.27-alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./deploy/nginx.conf:/etc/nginx/conf.d/default.conf:ro
      - ./deploy/certs:/etc/letsencrypt:ro
      - static:/srv/flask-app/static:ro
    depends_on:
      - app
    restart: unless-stopped

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD: app
      POSTGRES_DB: app
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app"]
      interval: 5s
      timeout: 3s
      retries: 10
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    command: ["redis-server", "--save", "60", "1", "--loglevel", "warning"]
    volumes:
      - redisdata:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 10
    restart: unless-stopped

volumes:
  pgdata:
  redisdata:
  static:

7. Kubernetes

For anything running at real scale, Kubernetes is the target. Key objects: a Deployment with rolling update + probes, a Service, an Ingress, and an HorizontalPodAutoscaler.

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: flask-app
  labels: { app: flask-app }
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0     # zero-downtime; always keep 3 healthy
  selector:
    matchLabels: { app: flask-app }
  template:
    metadata:
      labels: { app: flask-app }
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port:   "8000"
        prometheus.io/path:   "/metrics"
    spec:
      terminationGracePeriodSeconds: 45   # > gunicorn graceful_timeout
      containers:
        - name: app
          image: ghcr.io/acme/flask-app:1.42.0
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: GUNICORN_BIND
              value: "0.0.0.0:8000"
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef: { name: flask-app-secrets, key: database_url }
            - name: SECRET_KEY
              valueFrom:
                secretKeyRef: { name: flask-app-secrets, key: secret_key }
          resources:
            requests: { cpu: "250m", memory: "256Mi" }
            limits:   { cpu: "1",    memory: "512Mi" }
          readinessProbe:
            httpGet: { path: /healthz/ready, port: http }
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet: { path: /healthz/live, port: http }
            initialDelaySeconds: 15
            periodSeconds: 20
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                # Let the service endpoints controller remove this pod
                # from rotation before gunicorn starts shutting down.
                command: ["sh", "-c", "sleep 10"]
---
apiVersion: v1
kind: Service
metadata:
  name: flask-app
spec:
  selector: { app: flask-app }
  ports:
    - port: 80
      targetPort: http
      name: http
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: flask-app
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
spec:
  ingressClassName: nginx
  tls:
    - hosts: [ api.example.com ]
      secretName: flask-app-tls
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: flask-app
                port: { name: http }
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: flask-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: flask-app
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target: { type: Utilization, averageUtilization: 70 }
    - type: Resource
      resource:
        name: memory
        target: { type: Utilization, averageUtilization: 80 }
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 60

Two probes with different semantics matter: /healthz/live returns 200 while the process is alive (used to restart deadlocked pods), and /healthz/ready returns 200 only when DB + Redis + downstream deps are reachable (used to gate traffic).

8. Configuration Management

Follow the 12-factor rule: all config via environment. Flask config classes keep per-environment defaults versioned; secrets stay out of git.

# config.py
import os

class BaseConfig:
    SECRET_KEY = os.environ["SECRET_KEY"]           # required
    SQLALCHEMY_DATABASE_URI = os.environ["DATABASE_URL"]
    SQLALCHEMY_TRACK_MODIFICATIONS = False
    SQLALCHEMY_ENGINE_OPTIONS = {
        "pool_size": 10,
        "max_overflow": 20,
        "pool_pre_ping": True,
        "pool_recycle": 1800,
    }
    REDIS_URL = os.environ["REDIS_URL"]
    SESSION_COOKIE_SECURE = True
    SESSION_COOKIE_HTTPONLY = True
    SESSION_COOKIE_SAMESITE = "Lax"
    PREFERRED_URL_SCHEME = "https"

class DevConfig(BaseConfig):
    DEBUG = True
    SESSION_COOKIE_SECURE = False   # http on localhost

class ProdConfig(BaseConfig):
    DEBUG = False
    TESTING = False

def load(app):
    env = os.environ.get("FLASK_ENV", "production")
    app.config.from_object({"development": DevConfig,
                            "production":  ProdConfig}[env])

Secrets precedence (prod):

AWS Secrets Manager / HashiCorp Vault / GCP Secret Manager — fetched at pod start and projected as env vars (or via CSI driver).
Kubernetes Secret objects — encrypted at rest with KMS.
Never: .env files in the image, hard-coded constants, or secrets in Git.

For local dev only: python-dotenv loads .env, which is listed in both .gitignore and .dockerignore.

9. Logging

Containers log to stdout/stderr as a single JSON object per line. The platform (Docker, Kubernetes, ECS) ships them to CloudWatch / ELK / Loki. Correlation IDs let you stitch a single request across Nginx, Flask, and downstream services.

# logging_setup.py
import logging
import os
import uuid
from flask import g, request
from pythonjsonlogger import jsonlogger

def configure_logging(app):
    handler = logging.StreamHandler()
    fmt = jsonlogger.JsonFormatter(
        "%(asctime)s %(levelname)s %(name)s %(message)s "
        "%(request_id)s %(user_id)s %(path)s %(status)s %(duration_ms)s",
        rename_fields={"asctime": "ts", "levelname": "level"},
    )
    handler.setFormatter(fmt)

    root = logging.getLogger()
    root.handlers = [handler]
    root.setLevel(os.environ.get("LOG_LEVEL", "INFO"))

    # Quiet noisy libs
    logging.getLogger("urllib3").setLevel(logging.WARNING)
    logging.getLogger("botocore").setLevel(logging.WARNING)

    @app.before_request
    def _req_start():
        g.request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
        g._t0 = time.monotonic()

    @app.after_request
    def _req_end(resp):
        dur_ms = int((time.monotonic() - g._t0) * 1000)
        app.logger.info(
            "request",
            extra={
                "request_id": g.request_id,
                "user_id": getattr(g, "user_id", None),
                "path": request.path,
                "method": request.method,
                "status": resp.status_code,
                "duration_ms": dur_ms,
            },
        )
        resp.headers["X-Request-ID"] = g.request_id
        return resp

10. Observability

Three pillars, three tools:

Metrics — prometheus_flask_exporter exposes /metrics with per-endpoint latency histograms, status-code counters, and in-flight gauges. Scraped by Prometheus, visualized in Grafana.
Traces — OpenTelemetry auto-instrumentation for Flask, SQLAlchemy, Redis, and requests. Export via OTLP to Jaeger, Tempo, or an APM vendor. Critical for debugging fan-out latency across microservices.
Errors — Sentry SDK captures unhandled exceptions with request context and stack traces; dedupe by fingerprint, alert on new issue rate.

from prometheus_flask_exporter import PrometheusMetrics
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
import sentry_sdk
from sentry_sdk.integrations.flask import FlaskIntegration

def configure_observability(app):
    metrics = PrometheusMetrics(app, group_by="endpoint")
    metrics.info("app_info", "Flask app", version=os.environ.get("APP_VERSION", "dev"))

    FlaskInstrumentor().instrument_app(app)
    SQLAlchemyInstrumentor().instrument(engine=app.extensions["sqlalchemy"].engine)

    sentry_sdk.init(
        dsn=os.environ.get("SENTRY_DSN"),
        integrations=[FlaskIntegration()],
        traces_sample_rate=0.05,
        profiles_sample_rate=0.01,
        environment=os.environ.get("FLASK_ENV", "production"),
        release=os.environ.get("APP_VERSION"),
    )

11. Security Hardening

SECRET_KEY — 32+ random bytes, rotated at least annually. Rotation invalidates sessions; use a signed-fallback SECRET_KEY_FALLBACKS list (Flask 2.3+) to roll over without locking everyone out.
CSRF — Flask-WTF's CSRFProtect for any cookie-authenticated form endpoint. Not needed for pure JSON APIs using bearer tokens.
Cookies — always SESSION_COOKIE_SECURE=True, HTTPONLY=True, SAMESITE="Lax" (or "Strict").
Flask-Talisman — sets CSP, HSTS, X-Frame-Options, X-Content-Type-Options in one line; forces HTTPS redirects when behind a proxy.
Rate limiting — Flask-Limiter backed by Redis; hit per-IP and per-user limits at the app layer, per-IP at Nginx for DDoS.
Dependency scanning — pip-audit and safety in CI; fail the build on known CVEs. Add Dependabot / Renovate for automated PRs.
Container scanning — Trivy or Grype on every image push.
Least-privilege IAM — IRSA on EKS / Workload Identity on GKE; never bake AWS keys into images.

12. Database Migrations

Flask-Migrate wraps Alembic. The hard part is not the tool — it's making migrations safe to run while the old version of the app is still serving traffic.

Expand-contract (three deploys per breaking change):

Expand — add the new column/table/index (nullable=True, no default for large tables — fill via backfill job). Deploy app that writes to both old and new.
Migrate — backfill historical rows; flip reads to the new column.
Contract — remove the old column once no code references it and you've held on the previous step long enough to roll back safely.

For Postgres: use CREATE INDEX CONCURRENTLY, avoid ALTER TABLE ... ADD COLUMN NOT NULL DEFAULT on large tables (pre-14 rewrites the whole table), and set lock_timeout on migration sessions so a stuck migration doesn't freeze production.

13. Zero-Downtime Deploys

Gunicorn handles SIGTERM by closing the listening socket, stopping new request acceptance, and giving workers graceful_timeout seconds to finish in-flight requests. Getting this right in Kubernetes requires coordinating:

terminationGracePeriodSeconds (pod) > graceful_timeout (gunicorn) > p99 request duration. E.g. 45s / 30s / 10s.
A preStop hook that sleeps long enough for the endpoints controller to remove the pod from the Service's endpoint list before Gunicorn starts rejecting connections. 10–15 seconds is usually sufficient.
Connection draining at the load balancer (ALB deregistration_delay, Nginx worker_shutdown_timeout).
preload_app=True tradeoff — saves memory via copy-on-write but means workers share module-level state. Anything that opens a network connection at import time (DB pools, Kafka producers) must be re-opened in post_fork hooks, otherwise forked workers share the parent's socket and you get bizarre cross-talk bugs.

14. Scaling Strategies

Vertical — more CPU / RAM per pod, more Gunicorn workers. Hits diminishing returns: the GIL, DB connection pool exhaustion, and NUMA effects all cap per-pod throughput. Rule of thumb: 2–4 CPU per pod, then scale out.

Horizontal — more pods behind the Service. Scales linearly until the database becomes the bottleneck. Plan for this: read replicas, connection pooler (PgBouncer in transaction mode), cache-aside with Redis, materialized views for heavy reads.

Sessions — do not use Flask's default client-side signed cookie for sessions of any real size. Move to server-side sessions backed by Redis via Flask-Session; pods become stateless and any pod can handle any request (no sticky sessions required).

from flask_session import Session
app.config.update(
    SESSION_TYPE="redis",
    SESSION_REDIS=redis.from_url(os.environ["REDIS_URL"]),
    SESSION_USE_SIGNER=True,
    SESSION_KEY_PREFIX="sess:",
    PERMANENT_SESSION_LIFETIME=timedelta(hours=8),
)
Session(app)

15. Cost / Performance Tuning

Worker count — start at (2 × CPU) + 1; load-test with wrk or k6 and tune. Too few → queueing latency; too many → context-switch overhead and memory bloat. Watch gunicorn_requests_duration_seconds p95 and in-flight saturation.
gevent vs sync — for a typical Flask app that spends >60% of request time waiting on a DB or HTTP call, gevent with 100–1000 greenlets per worker can 10× throughput per pod. For CPU-heavy endpoints (ML inference, PDF generation, image processing) gevent hurts — stick with sync/gthread and scale horizontally.
Threads — gthread with 4–8 threads gives cheap concurrency for CPU-lite endpoints without the debugging complexity of gevent's monkey-patching.
Connection pooling — SQLAlchemy pool_size × worker count must not exceed Postgres max_connections. With 4 pods × 9 workers × 10 pool = 360 connections — put PgBouncer in front if max_connections is tight.
Response size — enable gzip / brotli at Nginx; paginate; avoid sending fields the client doesn't use. The cheapest byte is the one you don't serialize.
HTTP keepalive — behind ALB set keepalive_timeout higher than the ALB's idle timeout (default 60s) to avoid 502s from race conditions on connection close.

A well-tuned Flask pod on 1 vCPU / 512MB with gthread workers can comfortably handle 200–500 req/s at sub-50ms p95 for DB-backed JSON endpoints. When you need more, scale horizontally first — it is almost always cheaper than chasing micro-optimizations.