Performance Optimization — Databricks & Tableau

What steps would you take to optimize the performance of a Databricks job?

Use Photon runtime & autoscaling clusters.
Store in Delta/Parquet, compact files, use Z-ORDER.
Enable Adaptive Query Execution (AQE).
Tune shuffle partitions & use broadcast joins.
Filter/prune columns early, cache selectively.
Prefer built-ins over UDFs.
Monitor Spark UI & fix bottlenecks.

How would you handle data skew in a Databricks job?

Detect with Spark UI (slow tasks, uneven partitions).
Enable AQE skew handling.
Broadcast small tables, pre-aggregate before joins.
Apply salting or composite keys.
Repartition by uniform keys or process hot keys separately.

What are the best practices for writing efficient ETL pipelines in Databricks?

Use Delta tables with Auto Loader for ingestion.
Build bronze → silver → gold layers, incremental not full reloads.
Optimize/Z-ORDER for query speed.
Parameterize code, add data quality checks.
Avoid UDFs, reduce shuffles, prune columns early.
Orchestrate with retries, alerts, checkpoints.

How would you optimize a Tableau dashboard for performance when dealing with large datasets?

Use extracts (Hyper), aggregated/incremental refresh.
Pre-aggregate data, reduce marks (<100k per view).
Use context filters, avoid high-cardinality quick filters.
Push heavy calcs upstream; limit complex LODs.
Keep dashboards simple, fewer worksheets.
Monitor with Performance Recording & query plans.

Common Interview Questions:

What does Adaptive Query Execution actually do?

AQE re-plans a Spark query between stages using the actual shuffle statistics rather than the compile-time estimates. It coalesces small post-shuffle partitions into larger ones, converts sort-merge joins to broadcast joins when one side turns out to fit in memory, and splits skewed partitions into sub-partitions so a single hot key cannot stall the stage. It is on by default in DBR 7.3+ and is almost always a free win — disable it only when reproducing a deterministic plan for debugging.

What is the right target file size for a Delta table and how do you reach it?

128MB to 1GB is the sweet spot — small enough to parallelize, large enough that per-file overhead does not dominate. Delta auto-tunes via delta.tuneFileSizesForRewrites on tables that see frequent MERGE/UPDATE; for append-only tables, enable optimizedWrites and autoCompact, and run periodic OPTIMIZE for tables with steady streaming writes. The classic anti-pattern is thousands of 5MB files from streaming jobs without compaction; those tables become slow from metadata overhead long before they are large.

What is Photon and when does it help?

Photon is Databricks' vectorized C++ execution engine that replaces parts of the Spark JVM execution layer for SQL and DataFrame workloads. It dramatically speeds up scans, joins, aggregations, and writes on Delta — typical 2-3x speedup on aggregation-heavy SQL. It does nothing for Python UDFs (which still bounce through the JVM-Python bridge) and adds no value for tiny queries where engine overhead is irrelevant. Use it for SQL warehouses, ETL on large Delta tables, and BI workloads; skip it for clusters dominated by Python UDF logic.

When do you force a broadcast join with a hint?

When you know one side fits in executor memory but Catalyst will not pick the broadcast — usually because table statistics are stale, the data comes from a UDF that does not expose row counts, or the threshold is conservative. Use broadcast(df) in PySpark or /*+ BROADCAST(t) */ in SQL. The risk is that "fits in memory" is wrong: the small side actually has 5GB of data, the broadcast OOMs the driver during collection or the executors during distribution. Always check actual row counts first; do not broadcast based on assumptions.

How do you detect and fix data skew in a Spark job?

In the Spark UI, look at the stage's task duration distribution — if the max task is 10x the median, you have skew. Confirm with shuffle read sizes per task. The first fix is to enable AQE skew handling (spark.sql.adaptive.skewJoin.enabled), which auto-splits the heavy partitions. If that is not enough, salt the skewed key (append a random suffix to the heavy side, explode the other side over the suffix range and join) or split the hot keys into a separate broadcast join and union the result. Salting is heavyweight; reserve it for the rare cases AQE cannot resolve.

What are the highest-leverage optimizations to apply before tuning anything else?

In order: enable Photon on the SQL warehouse or all-purpose cluster; turn on AQE (default in modern DBR but worth verifying); ensure Delta auto-compaction and optimized writes are on for tables that get incremental writes; add Liquid Clustering or Z-Order on the columns that drive WHERE and JOIN predicates; right-size the cluster (over-provisioning hides skew, under-provisioning starves parallelism). Only after these should you reach for query rewrites, broadcast hints, or salting. Most "slow Databricks" tickets resolve from this list before any code changes.