Databricks Asset Bundles are the platform's first-party answer to "how do I treat my workspace like infrastructure?" A bundle is a directory containing a databricks.yml manifest and the source files it references — notebooks, Python wheels, DLT pipeline definitions, ML experiments, dashboards. The Databricks CLI reads the manifest, resolves variables and target overrides, and deploys the whole package to a workspace as a coherent unit. The same bundle definition produces dev, staging, and prod environments with only target-level differences.
Bundles replace the older "deploy notebooks with a Bash script and configure jobs in the UI" workflow with declarative IaC that is checked into source control, code-reviewed, and applied through CI/CD.
A bundle is a directory tree like this:
my_bundle/
├── databricks.yml # the manifest
├── README.md
├── src/
│ ├── ingest_orders.py
│ ├── transform_silver.py
│ └── dlt_pipeline.py
├── notebooks/
│ └── exploration.ipynb
├── resources/
│ ├── orders_job.yml # split out for readability
│ └── orders_pipeline.yml
└── tests/
└── test_transform_silver.py
Anything you can configure in the workspace UI — jobs, DLT pipelines, ML model registrations, model serving endpoints, AI/BI dashboards, SQL warehouses, secret scopes — can be declared in YAML and shipped via a bundle. Workspaces become deterministic and disposable.
databricks.yml Structure
The manifest has four top-level blocks: bundle, workspace, resources, and targets. A minimal manifest looks like this:
bundle:
name: orders_pipeline
workspace:
host: https://adb-1234567890123456.7.azuredatabricks.net
variables:
catalog:
description: Unity Catalog target catalog
default: dev_orders
notifications_email:
default: data-eng@example.com
resources:
jobs:
orders_etl:
name: orders_etl_${bundle.target}
tasks:
- task_key: ingest
notebook_task:
notebook_path: ./src/ingest_orders.py
job_cluster_key: shared_cluster
- task_key: transform
depends_on:
- task_key: ingest
notebook_task:
notebook_path: ./src/transform_silver.py
job_cluster_key: shared_cluster
job_clusters:
- job_cluster_key: shared_cluster
new_cluster:
spark_version: 15.4.x-scala2.12
node_type_id: Standard_D4ds_v5
num_workers: 2
email_notifications:
on_failure: [${var.notifications_email}]
targets:
dev:
mode: development
default: true
workspace:
root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev
variables:
catalog: dev_orders
prod:
mode: production
workspace:
root_path: /Shared/.bundle/${bundle.name}/prod
run_as:
service_principal_name: 0123abcd-...-....
variables:
catalog: prod_orders
notifications_email: oncall@example.com
Key concepts:
bundle — identity (name, optional git stamp). The CLI uses this to namespace deployed assets.workspace — default workspace host and the root path under which the CLI uploads source files.variables — typed inputs you can override per target or on the CLI with --var key=value. Substitute as ${var.name}.resources — the actual stuff being deployed: jobs, pipelines, models, model_serving_endpoints, experiments, schemas, volumes, dashboards, etc.targets — environments. Each target can override workspace, variables, run-as identity, and any resource field.
The mode field on a target is the most consequential setting. It changes how the CLI rewrites resources before deploy:
mode: development — prefixes resource names with [dev your.user], pauses schedules, sets run_as to the deploying user. Designed for safe iteration in a personal workspace path.mode: production — deploys with the configured names, schedules are active, requires a service principal as run_as, and refuses to deploy from a personal user folder. The CLI also warns if you deploy from a dirty git tree.
This makes it easy to databricks bundle deploy from a laptop into a personal sandbox while the same bundle goes through CI for production.
Already shown above. The pattern that matters: declare job_clusters once and reference them via job_cluster_key from each task so all tasks share warm compute.
resources:
pipelines:
orders_dlt:
name: orders_dlt_${bundle.target}
catalog: ${var.catalog}
target: silver
libraries:
- notebook:
path: ./src/dlt_pipeline.py
configuration:
bronze.path: /Volumes/${var.catalog}/raw/orders
clusters:
- label: default
autoscale:
min_workers: 1
max_workers: 4
development: false
photon: true
serverless: true
resources:
model_serving_endpoints:
fraud_scorer:
name: fraud-scorer-${bundle.target}
config:
served_entities:
- entity_name: ${var.catalog}.ml.fraud_model
entity_version: "7"
workload_size: Small
scale_to_zero_enabled: true
traffic_config:
routes:
- served_model_name: fraud_model-7
traffic_percentage: 100
auto_capture_config:
catalog_name: ${var.catalog}
schema_name: inference_logs
table_name_prefix: fraud_scorer
resources:
experiments:
fraud_training:
name: /Shared/experiments/fraud_training_${bundle.target}
tags:
- key: owner
value: ml-platform
The CLI ships as databricks. Bundle commands are under databricks bundle.
# Validate the manifest, resolve variables, show what will be deployed
databricks bundle validate --target dev
# Upload sources and apply resources to the dev target
databricks bundle deploy --target dev
# Trigger a job by its resource key (not its display name)
databricks bundle run orders_etl --target dev
# Tail logs from the most recent run of a job
databricks bundle run orders_etl --target dev --no-wait
databricks jobs list-runs --job-id $(databricks bundle summary --target dev -o json | jq -r '.resources.jobs.orders_etl.id')
# Promote to production
databricks bundle deploy --target prod
# Tear down everything the bundle created in a target
databricks bundle destroy --target dev
The standard pattern is: validate on every PR, deploy to staging on merge to main, deploy to prod on a tagged release. Authenticate with an OAuth machine-to-machine (M2M) service principal token stored as a GitHub secret.
name: deploy-bundle
on:
pull_request:
branches: [main]
push:
branches: [main]
tags: ['v*']
jobs:
validate:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: databricks/setup-cli@main
- run: databricks bundle validate --target staging
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_CLIENT_ID }}
DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}
deploy-staging:
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: databricks/setup-cli@main
- run: databricks bundle deploy --target staging
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_CLIENT_ID }}
DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_CLIENT_SECRET }}
deploy-prod:
if: startsWith(github.ref, 'refs/tags/v')
runs-on: ubuntu-latest
environment: production # require manual approval
steps:
- uses: actions/checkout@v4
- uses: databricks/setup-cli@main
- run: databricks bundle deploy --target prod
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_PROD_HOST }}
DATABRICKS_CLIENT_ID: ${{ secrets.DATABRICKS_PROD_CLIENT_ID }}
DATABRICKS_CLIENT_SECRET: ${{ secrets.DATABRICKS_PROD_CLIENT_SECRET }}
The environment: production line ties the prod job to a GitHub Environment, which can require reviewer approval before the step runs — a cheap and effective production guardrail.
For new projects, bootstrap with databricks bundle init, which scaffolds a bundle from a template. Built-in templates exist for default Python jobs, DLT pipelines, MLOps stacks, and dbt projects, and you can author custom templates as Go templates over a YAML config.
# Use a built-in template
databricks bundle init default-python
# Use a custom template from a git repo
databricks bundle init https://github.com/example/my-dab-template
The Terraform Databricks provider has been the IaC tool for Databricks for years. It is broader in scope — it manages workspace creation, networking, accounts, and other control-plane resources. Asset Bundles cover only what lives inside a workspace, but they cover it more idiomatically.
| Concern | Asset Bundles | Terraform Databricks Provider |
|---|---|---|
| Scope | In-workspace assets (jobs, pipelines, dashboards, model serving). | Workspaces, accounts, networking, plus all in-workspace assets. |
| State management | State stored in a workspace folder; no external backend to operate. | Standard Terraform state — you operate S3/Azure Blob/Cloud Storage backends. |
| Source bundling | Native — the CLI uploads notebooks/wheels alongside the resources. | Manual — you script the artifact upload yourself. |
| Dev experience | mode: development auto-namespaces and pauses schedules. |
You build the dev/prod separation yourself with workspaces or modules. |
The pragmatic split: Terraform owns the platform (workspace creation, metastore, account-level groups). Bundles own the application (this team's jobs, pipelines, ML models, dashboards). The two compose well — Terraform stands up the workspace, then each application team ships into it with their own bundle and CI/CD.
DABs package the code, configuration, and resource definitions for a Databricks project (jobs, DLT pipelines, ML experiments, dashboards) into a single declarative YAML file that can be version-controlled, reviewed, and deployed via CLI. Before bundles, projects were assembled by clicking through the UI or piecing together databricks-cli calls in shell scripts — there was no single source of truth and no way to diff a deployment. DABs make a Databricks project look and behave like a normal application repository.
Terraform is for platform infrastructure: workspaces, metastores, account-level groups, network configuration — things that exist once per environment and rarely change. DABs are for application artifacts: the jobs, pipelines, and notebooks that a data team ships every sprint. Terraform's resource graph is too heavy for daily app deploys, and DABs do not model account-level objects. The clean split is Terraform owns the workspace, DABs own what runs inside it.
Targets (dev, staging, prod) are named overrides that let one bundle deploy to multiple environments with different cluster sizes, storage paths, schedule cadences, and run-as identities. The base configuration sits at the top level; each target overrides only the fields it needs to change, and you select one with databricks bundle deploy -t prod. The pattern keeps the application code identical across environments while letting cost and access policies vary — exactly the same idea as Helm values files.
The standard pipeline: on PR, run databricks bundle validate and unit tests against a dev target; on merge to main, deploy to staging with databricks bundle deploy -t staging and run integration tests via databricks bundle run; on tagged release, deploy to prod. Authenticate using OAuth machine-to-machine credentials (service principals) stored as GitHub Actions secrets — never personal access tokens. Pin the databricks-cli version in the workflow so a CLI release does not silently change deployment behavior.
Never put secrets in bundle YAML — even with variable substitution they end up in plaintext in the workspace's deployed artifact. The right pattern is to reference Databricks secret scopes from job configuration ({{secrets/scope/key}}) and have the secrets themselves provisioned out-of-band by Terraform or a dedicated secret-sync job pulling from AWS Secrets Manager / Azure Key Vault. The bundle ships the reference; the platform provides the value.
Build the shared code as a Python wheel published to a private package index (Artifactory, AWS CodeArtifact, or a Unity Catalog volume), then reference it as a library dependency in each bundle's job clusters. Avoid the temptation to copy a notebook into multiple repos — drift is inevitable. For very early-stage shared code, a UC volume holding the latest wheel works fine; graduate to a versioned package index once more than two teams depend on it.