Databricks Optimization Guide
Understanding Databricks Performance
Databricks is a unified analytics platform built on Apache Spark that provides a collaborative environment for data engineering, data science, and machine learning workloads. Performance in Databricks depends on several interconnected factors including cluster configuration, data layout, query optimization, and proper use of platform-specific features.
The platform offers both automatic optimizations (enabled by default in Databricks Runtime 10.4 LTS and above) and manual tuning options. Key performance drivers include:
- Cluster Resources: CPU, memory, and storage allocation across worker nodes
- Data Layout: How data is physically organized in Delta Lake tables
- Query Execution: How Spark processes and optimizes queries
- Caching: Storing frequently accessed data closer to compute
- Parallelism: Efficient distribution of work across the cluster
Key Performance Challenges
- Data Skew: Uneven data distribution causing some partitions to be significantly larger than others, leading to resource imbalances
- Small Files Problem: Too many tiny files create overhead from opening/closing files and metadata management
- Inefficient Joins: Poorly optimized join operations causing excessive data shuffling
- Under/Over-Provisioned Clusters: Mismatched resources leading to wasted spend or slow performance
- Suboptimal Query Plans: Queries not leveraging available optimizations
Strategies to Increase Databricks Performance
1. Cluster Configuration Optimization
Proper cluster sizing is the foundation of performance optimization:
- Right-size your cluster: Match worker count and instance types to workload requirements
- Use compute-optimized instances (e.g., AWS C5) for CPU-intensive ETL pipelines
- Use memory-optimized instances (e.g., R5) for ML workloads with large in-memory datasets
- Enable Autoscaling: Set minimum and maximum worker counts to dynamically adjust to workload demands
- Prevents over-provisioning during low-activity periods
- Ensures adequate resources during peak times
- Use Databricks Pools: Pre-allocate instances to reduce cluster start and autoscaling times
- Idle instances in pools only incur VM costs, not DBU costs
- Recommended for workloads with tight SLAs
- Leverage Spot Instances: Use for non-critical jobs to save 70-90% on compute costs
- Use Latest Databricks Runtime: Always use the newest LTS version for performance enhancements
2. Enable Photon Engine
Photon is Databricks' next-generation vectorized query engine built in C++ that accelerates SQL and DataFrame workloads:
- Performance Gains: Delivers 2-10x speedups for analytical queries, with some ETL workloads running up to 15x faster
- How to Enable:
- Check "Use Photon Acceleration" in cluster configuration UI
- Use Databricks Runtime 9.1 LTS or above
- For API: set
"runtime_engine": "PHOTON"
- Best Use Cases:
- ETL pipelines
- Large-scale analytical queries
- BI dashboards using SQL
- Feature engineering jobs
- Limitations:
- Does not accelerate Python UDFs
- Limited benefit for iterative ML training loops
3. Delta Lake Optimizations
File Size Optimization
Z-Ordering
Liquid Clustering (Recommended for New Tables)
- What it does: Dynamically and continuously reorganizes data based on clustering keys without static partitions
- Advantages over Z-Ordering:
- Incremental optimization - only reorganizes unclustered data
- Adapts to changing query patterns
- More efficient writes
- Works across entire table dynamically
- Syntax:
-- Create table with liquid clustering
CREATE TABLE table_name CLUSTER BY (column1, column2);
-- Add to existing table
ALTER TABLE table_name CLUSTER BY (column1, column2);
-- Run optimization
OPTIMIZE table_name;
- Guidelines:
- Best for tables over 1TB
- Keep clustering keys to 1-4 columns
- Not compatible with partitioning or ZORDER on same table
Table Partitioning
4. Caching Strategies
Disk Cache (Delta Cache)
Spark Cache
5. Query Optimization
Adaptive Query Execution (AQE)
Cost-Based Optimizer (CBO)
Predicate Pushdown and Partition Pruning
Dynamic File Pruning
- What it does: Skips directories that don't contain matching data files
- Enabled by default: In Databricks Runtime 10.4 LTS and above
6. Join Optimization
7. Shuffle Optimization
8. Code-Level Best Practices
- Avoid Python/Scala UDFs when native functions exist:
- UDFs require serialization between Python and Spark
- Use built-in Spark SQL functions instead
- Use higher-order functions for array operations
- Use DataFrame/SQL APIs over RDDs: Enables Catalyst optimizer to work effectively
- Prefer Managed Tables: Unity Catalog managed tables get automatic predictive optimization
- Use Delta Lake format: Provides ACID transactions, time travel, and optimization features
9. Predictive Optimization (Unity Catalog)
- What it does: Automatically identifies and runs maintenance operations on tables
- Benefits:
- Eliminates manual maintenance scheduling
- Optimizes based on actual query patterns
- Typically provides significant performance improvements
- Enable: At account, catalog, or schema level in Unity Catalog settings
10. Regular Maintenance
Performance Optimization Checklist
- Cluster Setup
- Use latest Databricks Runtime LTS
- Enable Photon for analytical workloads
- Configure appropriate autoscaling
- Select right instance types for workload
- Data Layout
- Use Delta Lake format
- Implement Liquid Clustering for new tables
- Or use Z-Ordering for existing tables
- Maintain optimal file sizes (16MB-1GB)
- Query Design
- Apply filters early
- Use native functions over UDFs
- Leverage broadcast joins for small tables
- Keep table statistics updated
- Caching
- Enable disk cache for frequently accessed data
- Use Spark cache for iterative processing
- Monitor cache hit rates
- Maintenance
- Schedule regular OPTIMIZE jobs
- Run VACUUM to clean up old files
- Update statistics with ANALYZE TABLE
- Enable Predictive Optimization if using Unity Catalog
Monitoring and Troubleshooting
Expected Performance Improvements
When properly implementing these optimization techniques, organizations typically see:
- 300-800% performance improvements through intelligent cluster sizing, data skew prevention, and query optimization
- 40-60% reduction in compute costs through proper resource allocation and autoscaling
- 2-10x faster queries with Photon engine for analytical workloads
- Significant I/O reduction through Z-Ordering and Liquid Clustering