Databricks Optimization Guide

Understanding Databricks Performance

Databricks is a unified analytics platform built on Apache Spark that provides a collaborative environment for data engineering, data science, and machine learning workloads. Performance in Databricks depends on several interconnected factors including cluster configuration, data layout, query optimization, and proper use of platform-specific features.

The platform offers both automatic optimizations (enabled by default in Databricks Runtime 10.4 LTS and above) and manual tuning options. Key performance drivers include:

Cluster Resources: CPU, memory, and storage allocation across worker nodes
Data Layout: How data is physically organized in Delta Lake tables
Query Execution: How Spark processes and optimizes queries
Caching: Storing frequently accessed data closer to compute
Parallelism: Efficient distribution of work across the cluster

Key Performance Challenges

Data Skew: Uneven data distribution causing some partitions to be significantly larger than others, leading to resource imbalances
Small Files Problem: Too many tiny files create overhead from opening/closing files and metadata management
Inefficient Joins: Poorly optimized join operations causing excessive data shuffling
Under/Over-Provisioned Clusters: Mismatched resources leading to wasted spend or slow performance
Suboptimal Query Plans: Queries not leveraging available optimizations

Strategies to Increase Databricks Performance

1. Cluster Configuration Optimization

Proper cluster sizing is the foundation of performance optimization:

Right-size your cluster: Match worker count and instance types to workload requirements
- Use compute-optimized instances (e.g., AWS C5) for CPU-intensive ETL pipelines
- Use memory-optimized instances (e.g., R5) for ML workloads with large in-memory datasets
Enable Autoscaling: Set minimum and maximum worker counts to dynamically adjust to workload demands
- Prevents over-provisioning during low-activity periods
- Ensures adequate resources during peak times
Use Databricks Pools: Pre-allocate instances to reduce cluster start and autoscaling times
- Idle instances in pools only incur VM costs, not DBU costs
- Recommended for workloads with tight SLAs
Leverage Spot Instances: Use for non-critical jobs to save 70-90% on compute costs
Use Latest Databricks Runtime: Always use the newest LTS version for performance enhancements

2. Enable Photon Engine

Photon is Databricks' next-generation vectorized query engine built in C++ that accelerates SQL and DataFrame workloads:

Performance Gains: Delivers 2-10x speedups for analytical queries, with some ETL workloads running up to 15x faster
How to Enable:
- Check "Use Photon Acceleration" in cluster configuration UI
- Use Databricks Runtime 9.1 LTS or above
- For API: set "runtime_engine": "PHOTON"
Best Use Cases:
- ETL pipelines
- Large-scale analytical queries
- BI dashboards using SQL
- Feature engineering jobs
Limitations:
- Does not accelerate Python UDFs
- Limited benefit for iterative ML training loops

3. Delta Lake Optimizations

File Size Optimization

Target file size: Between 16MB and 1GB for optimal query performance
OPTIMIZE command: Compacts small files into larger ones (up to 1GB)
```
OPTIMIZE table_name;
```

Auto Optimize: Enable automatic compaction during writes

-- Table properties
ALTER TABLE table_name SET TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact' = 'true'
);

Unity Catalog Auto-tuning: Databricks Runtime 11.3+ automatically optimizes file sizes for managed tables

Z-Ordering

What it does: Physically reorganizes data within a Delta table based on specified columns, co-locating related data together
When to use: When queries frequently filter on specific columns

Syntax:

OPTIMIZE table_name ZORDER BY (column1, column2);

Best practices:
- Choose high-cardinality columns used in WHERE clauses
- Limit to 2-4 columns for best results
- Run regularly as part of maintenance jobs

Liquid Clustering (Recommended for New Tables)

What it does: Dynamically and continuously reorganizes data based on clustering keys without static partitions
Advantages over Z-Ordering:
- Incremental optimization - only reorganizes unclustered data
- Adapts to changing query patterns
- More efficient writes
- Works across entire table dynamically

Syntax:

-- Create table with liquid clustering
CREATE TABLE table_name CLUSTER BY (column1, column2);

-- Add to existing table
ALTER TABLE table_name CLUSTER BY (column1, column2);

-- Run optimization
OPTIMIZE table_name;

Guidelines:
- Best for tables over 1TB
- Keep clustering keys to 1-4 columns
- Not compatible with partitioning or ZORDER on same table

Table Partitioning

When to use: Tables larger than 1TB with predictable query patterns
Best practices:
- Do NOT partition tables under 1TB
- Choose low-cardinality columns (e.g., date, region)
- Apply filters on partition columns early in queries

Syntax:

CREATE TABLE table_name
PARTITIONED BY (date_column)
AS SELECT * FROM source_table;

4. Caching Strategies

Disk Cache (Delta Cache)

What it does: Stores frequently accessed data on local SSDs of worker nodes
Benefits: Reduces latency by avoiding repeated remote file reads

Enable:

SET spark.databricks.io.cache.enabled = true;

Best for: Repeated reads of Parquet/Delta files

Spark Cache

What it does: Stores DataFrames in memory for reuse
When to use: Iterative algorithms or when same data is accessed multiple times

Syntax:

df.cache()
df.persist(StorageLevel.MEMORY_AND_DISK)

Important: Unpersist when no longer needed to free memory

5. Query Optimization

Adaptive Query Execution (AQE)

What it does: Automatically optimizes query plans at runtime based on actual data statistics
Features:
- Dynamically coalesces shuffle partitions
- Converts sort-merge joins to broadcast joins when beneficial
- Optimizes skewed joins
Enable (on by default in Spark 3.0+):
```
SET spark.sql.adaptive.enabled = true;
```

Cost-Based Optimizer (CBO)

What it does: Uses table statistics to choose optimal query execution plans

Collect statistics:

ANALYZE TABLE table_name COMPUTE STATISTICS FOR ALL COLUMNS;

Best practices:
- Run ANALYZE TABLE regularly, especially after large data changes
- Run as a separate maintenance job, not during ETL
- Use EXPLAIN to verify optimizer is using statistics

Predicate Pushdown and Partition Pruning

Apply filters early: Place filter conditions immediately after reading the table

Filter on partition columns: Enables partition pruning to skip irrelevant data

-- Good: Filter immediately after read
SELECT * FROM table 
WHERE partition_col = 'value' 
  AND other_col > 100;

-- Apply filters before joins
SELECT * FROM table_a a
JOIN table_b b ON a.id = b.id
WHERE a.partition_col = 'value';

Dynamic File Pruning

What it does: Skips directories that don't contain matching data files
Enabled by default: In Databricks Runtime 10.4 LTS and above

6. Join Optimization

Broadcast Joins: Use for small tables (less than 10MB by default)

-- Hint to broadcast smaller table
SELECT /*+ BROADCAST(small_table) */ *
FROM large_table
JOIN small_table ON large_table.id = small_table.id;

Avoid Cartesian Products: Always use explicit join conditions
Handle Data Skew:
- Use AQE's skew join optimization
- Salting technique for severely skewed keys
- Filter out null keys before joining if not needed
Range Join Optimization: Manually tune settings for range joins using bin size configuration

7. Shuffle Optimization

Tune shuffle partitions: Default is 200, but optimal value depends on data size
```
SET spark.sql.shuffle.partitions = 400; -- Adjust based on workload
```
Low Shuffle Merge: Reduces files rewritten during MERGE operations (enabled by default)
Minimize shuffles:
- Use broadcast joins when possible
- Partition data by join keys
- Avoid unnecessary repartitioning

8. Code-Level Best Practices

Avoid Python/Scala UDFs when native functions exist:
- UDFs require serialization between Python and Spark
- Use built-in Spark SQL functions instead
- Use higher-order functions for array operations
Use DataFrame/SQL APIs over RDDs: Enables Catalyst optimizer to work effectively
Prefer Managed Tables: Unity Catalog managed tables get automatic predictive optimization
Use Delta Lake format: Provides ACID transactions, time travel, and optimization features

9. Predictive Optimization (Unity Catalog)

What it does: Automatically identifies and runs maintenance operations on tables
Benefits:
- Eliminates manual maintenance scheduling
- Optimizes based on actual query patterns
- Typically provides significant performance improvements
Enable: At account, catalog, or schema level in Unity Catalog settings

10. Regular Maintenance

OPTIMIZE: Run regularly to compact files and maintain clustering
```
OPTIMIZE table_name;
```

VACUUM: Remove old data files no longer needed

VACUUM table_name RETAIN 168 HOURS; -- 7 days

Schedule maintenance:
- Run during off-peak hours
- Use separate dedicated clusters for maintenance jobs
- Automate with Databricks workflows

Performance Optimization Checklist

Cluster Setup
- Use latest Databricks Runtime LTS
- Enable Photon for analytical workloads
- Configure appropriate autoscaling
- Select right instance types for workload
Data Layout
- Use Delta Lake format
- Implement Liquid Clustering for new tables
- Or use Z-Ordering for existing tables
- Maintain optimal file sizes (16MB-1GB)
Query Design
- Apply filters early
- Use native functions over UDFs
- Leverage broadcast joins for small tables
- Keep table statistics updated
Caching
- Enable disk cache for frequently accessed data
- Use Spark cache for iterative processing
- Monitor cache hit rates
Maintenance
- Schedule regular OPTIMIZE jobs
- Run VACUUM to clean up old files
- Update statistics with ANALYZE TABLE
- Enable Predictive Optimization if using Unity Catalog

Monitoring and Troubleshooting

Use Spark UI: Analyze query plans, stage details, and task distribution
Check for data skew: Look for uneven task durations in Spark UI
Review EXPLAIN output: Verify optimizer is making good choices
```
EXPLAIN EXTENDED SELECT * FROM table WHERE condition;
```
Monitor cluster metrics: CPU, memory, and I/O utilization
Enable pipeline logging: Track performance over time to identify bottlenecks
Use cluster tagging: Map resource usage to teams and workloads for cost tracking

Expected Performance Improvements

When properly implementing these optimization techniques, organizations typically see:

300-800% performance improvements through intelligent cluster sizing, data skew prevention, and query optimization
40-60% reduction in compute costs through proper resource allocation and autoscaling
2-10x faster queries with Photon engine for analytical workloads
Significant I/O reduction through Z-Ordering and Liquid Clustering