AQE re-plans a Spark query between stages using the actual shuffle statistics rather than the compile-time estimates. It coalesces small post-shuffle partitions into larger ones, converts sort-merge joins to broadcast joins when one side turns out to fit in memory, and splits skewed partitions into sub-partitions so a single hot key cannot stall the stage. It is on by default in DBR 7.3+ and is almost always a free win — disable it only when reproducing a deterministic plan for debugging.
128MB to 1GB is the sweet spot — small enough to parallelize, large enough that per-file overhead does not dominate. Delta auto-tunes via delta.tuneFileSizesForRewrites on tables that see frequent MERGE/UPDATE; for append-only tables, enable optimizedWrites and autoCompact, and run periodic OPTIMIZE for tables with steady streaming writes. The classic anti-pattern is thousands of 5MB files from streaming jobs without compaction; those tables become slow from metadata overhead long before they are large.
Photon is Databricks' vectorized C++ execution engine that replaces parts of the Spark JVM execution layer for SQL and DataFrame workloads. It dramatically speeds up scans, joins, aggregations, and writes on Delta — typical 2-3x speedup on aggregation-heavy SQL. It does nothing for Python UDFs (which still bounce through the JVM-Python bridge) and adds no value for tiny queries where engine overhead is irrelevant. Use it for SQL warehouses, ETL on large Delta tables, and BI workloads; skip it for clusters dominated by Python UDF logic.
When you know one side fits in executor memory but Catalyst will not pick the broadcast — usually because table statistics are stale, the data comes from a UDF that does not expose row counts, or the threshold is conservative. Use broadcast(df) in PySpark or /*+ BROADCAST(t) */ in SQL. The risk is that "fits in memory" is wrong: the small side actually has 5GB of data, the broadcast OOMs the driver during collection or the executors during distribution. Always check actual row counts first; do not broadcast based on assumptions.
In the Spark UI, look at the stage's task duration distribution — if the max task is 10x the median, you have skew. Confirm with shuffle read sizes per task. The first fix is to enable AQE skew handling (spark.sql.adaptive.skewJoin.enabled), which auto-splits the heavy partitions. If that is not enough, salt the skewed key (append a random suffix to the heavy side, explode the other side over the suffix range and join) or split the hot keys into a separate broadcast join and union the result. Salting is heavyweight; reserve it for the rare cases AQE cannot resolve.
In order: enable Photon on the SQL warehouse or all-purpose cluster; turn on AQE (default in modern DBR but worth verifying); ensure Delta auto-compaction and optimized writes are on for tables that get incremental writes; add Liquid Clustering or Z-Order on the columns that drive WHERE and JOIN predicates; right-size the cluster (over-provisioning hides skew, under-provisioning starves parallelism). Only after these should you reach for query rewrites, broadcast hints, or salting. Most "slow Databricks" tickets resolve from this list before any code changes.