Performance Optimization — Databricks & Tableau

What steps would you take to optimize the performance of a Databricks job?
How would you handle data skew in a Databricks job?
What are the best practices for writing efficient ETL pipelines in Databricks?
How would you optimize a Tableau dashboard for performance when dealing with large datasets?


Common Interview Questions:

What does Adaptive Query Execution actually do?

AQE re-plans a Spark query between stages using the actual shuffle statistics rather than the compile-time estimates. It coalesces small post-shuffle partitions into larger ones, converts sort-merge joins to broadcast joins when one side turns out to fit in memory, and splits skewed partitions into sub-partitions so a single hot key cannot stall the stage. It is on by default in DBR 7.3+ and is almost always a free win — disable it only when reproducing a deterministic plan for debugging.

What is the right target file size for a Delta table and how do you reach it?

128MB to 1GB is the sweet spot — small enough to parallelize, large enough that per-file overhead does not dominate. Delta auto-tunes via delta.tuneFileSizesForRewrites on tables that see frequent MERGE/UPDATE; for append-only tables, enable optimizedWrites and autoCompact, and run periodic OPTIMIZE for tables with steady streaming writes. The classic anti-pattern is thousands of 5MB files from streaming jobs without compaction; those tables become slow from metadata overhead long before they are large.

What is Photon and when does it help?

Photon is Databricks' vectorized C++ execution engine that replaces parts of the Spark JVM execution layer for SQL and DataFrame workloads. It dramatically speeds up scans, joins, aggregations, and writes on Delta — typical 2-3x speedup on aggregation-heavy SQL. It does nothing for Python UDFs (which still bounce through the JVM-Python bridge) and adds no value for tiny queries where engine overhead is irrelevant. Use it for SQL warehouses, ETL on large Delta tables, and BI workloads; skip it for clusters dominated by Python UDF logic.

When do you force a broadcast join with a hint?

When you know one side fits in executor memory but Catalyst will not pick the broadcast — usually because table statistics are stale, the data comes from a UDF that does not expose row counts, or the threshold is conservative. Use broadcast(df) in PySpark or /*+ BROADCAST(t) */ in SQL. The risk is that "fits in memory" is wrong: the small side actually has 5GB of data, the broadcast OOMs the driver during collection or the executors during distribution. Always check actual row counts first; do not broadcast based on assumptions.

How do you detect and fix data skew in a Spark job?

In the Spark UI, look at the stage's task duration distribution — if the max task is 10x the median, you have skew. Confirm with shuffle read sizes per task. The first fix is to enable AQE skew handling (spark.sql.adaptive.skewJoin.enabled), which auto-splits the heavy partitions. If that is not enough, salt the skewed key (append a random suffix to the heavy side, explode the other side over the suffix range and join) or split the hot keys into a separate broadcast join and union the result. Salting is heavyweight; reserve it for the rare cases AQE cannot resolve.

What are the highest-leverage optimizations to apply before tuning anything else?

In order: enable Photon on the SQL warehouse or all-purpose cluster; turn on AQE (default in modern DBR but worth verifying); ensure Delta auto-compaction and optimized writes are on for tables that get incremental writes; add Liquid Clustering or Z-Order on the columns that drive WHERE and JOIN predicates; right-size the cluster (over-provisioning hides skew, under-provisioning starves parallelism). Only after these should you reach for query rewrites, broadcast hints, or salting. Most "slow Databricks" tickets resolve from this list before any code changes.