Unity Catalog Setup

Prerequisites

Step 1: Create a Metastore

The metastore is the cloud-backed central repository for all Unity Catalog metadata.

-- Via SQL (in any workspace in the account)
CREATE METASTORE my_metastore
  PROVIDER_NAME aws  -- or 'azure', 'gcp'
  LOCATION 's3://my-bucket/metastore-root'  -- cloud storage path

Alternatively, use the Databricks account console (admin.databricks.com) under Admin → Metastores to create visually.

Step 2: Assign Metastore to Workspace

A workspace can use only one metastore. All workspaces assigned to the same metastore share catalogs and permissions.

-- Via account console or API
-- Associate the workspace with the metastore

Once assigned, run a test query to verify:

SELECT current_metastore();

Step 3: Create Catalogs and Schemas

Catalogs are top-level containers. Schemas organize tables within catalogs.

-- Create a catalog (requires account-level admin or owner role)
CREATE CATALOG finance_prod;

-- Create a schema within the catalog
CREATE SCHEMA finance_prod.transactions;

-- Create a table in the schema (3-level path)
CREATE TABLE finance_prod.transactions.orders (
  order_id INT,
  customer_id INT,
  order_date TIMESTAMP,
  amount DECIMAL(10, 2)
);

Step 4: Set Up External Locations and Storage Credentials

Required for external tables and volumes.

-- Create a storage credential (AWS example)
CREATE STORAGE CREDENTIAL s3_credentials
  PROVIDER amazon
  CREDENTIAL_PROVIDER_TYPE assume_role
  WITH (ROLE_ARN = 'arn:aws:iam::123456789012:role/UnityCatalogRole');

-- Create an external location tied to a storage credential
CREATE EXTERNAL LOCATION s3_data_lake
  URL 's3://my-data-bucket/external-data'
  WITH (STORAGE_CREDENTIAL = s3_credentials);

Step 5: Grant Permissions

Control who can access catalogs, schemas, and tables using the principle of least privilege.

-- Grant USE CATALOG to a group (enables browsing the catalog)
GRANT USE CATALOG ON CATALOG finance_prod TO `data-team@company.com`;

-- Grant USE SCHEMA to access a specific schema
GRANT USE SCHEMA ON SCHEMA finance_prod.transactions TO `data-team@company.com`;

-- Grant SELECT to read table data
GRANT SELECT ON TABLE finance_prod.transactions.orders TO `data-team@company.com`;

-- Grant MODIFY for write access
GRANT MODIFY ON TABLE finance_prod.transactions.orders TO `data-team@company.com`;

-- Grant ALL PRIVILEGES (full control, use sparingly)
GRANT ALL PRIVILEGES ON CATALOG finance_prod TO `admin-group@company.com`;

Step 6: View and Audit Permissions

-- Show all grants on a catalog
SHOW GRANTS ON CATALOG finance_prod;

-- Show all grants on a specific schema
SHOW GRANTS ON SCHEMA finance_prod.transactions;

-- Show grants for a specific principal
SHOW GRANTS ON SCHEMA finance_prod.transactions TO `data-team@company.com`;

-- View who has access to a table
SHOW GRANTS ON TABLE finance_prod.transactions.orders;

Best Practices


Common Interview Questions:

What is the three-level namespace in Unity Catalog and why does it matter?

Unity Catalog organizes data as catalog.schema.table instead of the legacy two-level schema.table (Hive metastore) or database.table. The catalog level is where you typically draw the environment or business-domain boundary — finance_prod, marketing_dev — and grants flow down the hierarchy. The benefit is one workspace can host many isolated data products with their own access policies, instead of needing one workspace per environment.

What is a Unity Catalog metastore and how does it relate to a workspace?

The metastore is an account-level object that holds catalog/schema/table metadata and grants — it lives above any individual workspace. A single metastore is attached to one or more workspaces in the same region, so multiple workspaces share the same data namespace and access policies. This is the architectural shift from the legacy Hive metastore, which was per-workspace; with UC, identity and data permissions are centralized at the account.

What is an external location and when do you use one?

An external location is a Unity Catalog object that pairs a cloud storage URI (e.g., s3://bucket/path/) with a storage credential (an IAM role or service principal). It governs read/write access to that path — UC checks READ FILES/WRITE FILES grants before issuing temporary credentials. Use external locations for data that lives in your own buckets and you want to keep control of (regulatory data, data shared with non-Databricks systems); use managed tables when UC can own the storage layout entirely.

What is the difference between a managed table and an external table in UC?

Managed tables: UC owns the storage location (a metastore-level managed root) and the file lifecycle — DROP TABLE deletes the data. External tables: you specify a path under an external location, UC tracks the metadata but does not delete files on drop. Default to managed tables for everything UC originates; use external tables only when the data must be readable by non-Databricks consumers, or when migrating from an existing data lake without moving files.

How does the Unity Catalog permission model work?

Permissions are granted to principals (users, groups, service principals) on securables (catalog, schema, table, view, function, volume, external location, model). To query a table you need USE CATALOG on its catalog, USE SCHEMA on its schema, and SELECT on the table itself — the chain enforces visibility at every level. Row filters and column masks attach to tables/views as functions for fine-grained control. Always grant to groups, never individuals; UC integrates with SCIM-provisioned identities from your IdP.

How do you migrate from the legacy Hive metastore to Unity Catalog?

Use UCX (the Databricks-provided migration toolkit) to inventory the legacy hive_metastore catalog, generate an assessment report, and plan moves. The mechanical steps are: create the UC metastore and a target catalog, set up storage credentials and external locations covering the source paths, then for each table either run SYNC for in-place upgrade (external tables) or CREATE TABLE ... DEEP CLONE for a copy. Update jobs and notebooks to use three-level names. Run both metastores in parallel during cutover; retire hive_metastore only after every consumer is migrated.