Amazon Redshift Architecture

Amazon Redshift is a fully managed, massively parallel processing (MPP) cloud data warehouse. It is designed for analytical workloads such as reporting, dashboards, and large-scale SQL queries over structured data. This tutorial explains the main components of Redshift architecture and how they work together.

High-Level Architecture Overview

The diagram below summarizes the high-level Amazon Redshift architecture.

High-level Amazon Redshift architecture diagram with client applications, leader node, compute nodes, and storage.

Client Applications – BI tools, SQL editors, and notebooks connect to Redshift over JDBC, ODBC, or the Data API.
Redshift Cluster – A collection of nodes (one leader node and one or more compute nodes) that execute queries and store data.
Managed Storage and Data Lake – RA3 nodes use managed storage that can transparently extend into Amazon S3, and Redshift Spectrum lets you query data directly in S3.
Surrounding AWS Services – Services such as S3, Glue, Lambda, Kinesis, and others integrate for ingestion, cataloging, ETL, and analytics.

Cluster Components

1. Leader Node

Entry point – All client connections terminate at the leader node.
SQL parser and optimizer – Parses incoming SQL, builds execution plans, and decides how to distribute work across compute nodes.
Coordinator – Sends query fragments to compute nodes and aggregates the intermediate results.
Catalog store – Holds metadata about databases, schemas, tables, and permissions.

2. Compute Nodes

Parallel workers – Each node runs multiple slices (worker processes). Slices operate on different portions of the data in parallel.
Local storage / cache – Data blocks are stored in columnar format. RA3 nodes cache frequently used blocks on fast SSD while automatically managing older blocks in S3.
Distribution styles – Data is distributed across nodes using:
- KEY – rows with the same key go to the same node
- EVEN – rows distributed round-robin
- ALL – small dimension tables fully replicated
Columnar storage and compression – Only relevant columns are read during queries, and compression reduces I/O and cost.

Storage and Data Lake Integration

1. Managed Storage (RA3)

Compute and storage decoupled – You can scale compute capacity (number of nodes) independently from how much data you store.
Automatic tiering – Hot data stays in SSD cache; cold data is transparently moved to Amazon S3.
No application changes – From the user’s perspective, the data behaves like local disk.

2. Redshift Spectrum: Querying Data in S3

Redshift Spectrum allows you to run queries against structured data stored directly in Amazon S3, without loading it into local Redshift tables.

Redshift Spectrum architecture diagram showing Redshift cluster querying data stored in Amazon S3 via an external data catalog.

External tables – Defined in an external data catalog such as AWS Glue. They reference data files (Parquet, ORC, CSV, JSON) in S3.
Hybrid queries – A single SQL query can join local Redshift tables with external tables in S3.
Pushdown of filters – Projection and filtering are pushed down to the Spectrum layer to minimize data scanned from S3.

Data Sharing Between Redshift Clusters

Amazon Redshift enables secure data sharing across clusters within the same AWS account or across accounts. This is useful for sharing curated data with different business units, teams, or workloads without copying data.

Redshift data sharing diagram showing a producer cluster sharing data with one or more consumer clusters.

Producer cluster – Owns the physical data and defines datashares that expose specific schemas, tables, and views.
Consumer clusters – Attach the datashares as standard database objects and query them as if they were local.
No data duplication – Data stays in the producer cluster’s storage; consumers access it in-place.
Fine-grained permissions – Producers control which objects are shared and who can access them.

Workload Management and Scaling

1. Workload Management (WLM)

Queues – Queries are routed into WLM queues, which define memory allocation and concurrency for different workloads.
Automatic WLM – Redshift can automatically manage memory and concurrency, reducing the need for manual tuning.
Query monitoring rules – Rules can log, hop, or abort queries based on thresholds (runtime, rows scanned, etc.).

2. Concurrency Scaling and Elastic Resize

Concurrency Scaling – Redshift automatically adds transient clusters to absorb short-lived spikes in query load.
Elastic resize – Allows you to change the number or type of nodes for a cluster to handle larger or smaller workloads over time.

Security and Networking

1. Network Isolation

Amazon VPC – Clusters run inside a virtual private cloud; security groups and network ACLs control inbound and outbound access.
Private connectivity – You can use VPN, Direct Connect, or VPC peering to securely connect on-premises systems or other VPCs.

2. Encryption and Access Control

Encryption at rest – Data can be encrypted using AWS KMS keys or hardware security modules.
Encryption in transit – Connections from clients can be protected with SSL/TLS.
IAM integration – IAM roles grant Redshift permission to read and write data in S3 for COPY and UNLOAD operations.
Database privileges – Users and groups are granted privileges on schemas, tables, and views, with support for row-level and column-level security.

Typical Query Lifecycle

A client application sends a SQL query to the Redshift endpoint.
The leader node parses, optimizes, and generates a distributed execution plan.
The leader node dispatches work to slices on the compute nodes.
Compute nodes scan columnar data from local or managed storage and optionally read external data from S3 via Spectrum.
Compute nodes aggregate intermediate results and send them back to the leader node.
The leader node performs the final aggregation or sorting and returns the result set to the client.

Putting It All Together

In practice, a Redshift-based analytics platform typically includes:

Data sources – Operational databases, event streams, and flat files.
Ingestion and ETL/ELT – Data is moved into S3 and then into Redshift using COPY, Glue jobs, Lambda, or third-party ETL tools.
Curated warehouse – Dimensional models and fact tables stored in Redshift and shared with downstream teams using data sharing.
Consumption layer – Dashboards, ad-hoc analytics, and machine-learning workloads running on top of Redshift and S3.

Use the diagrams above together with the section explanations as a visual tutorial for understanding how Amazon Redshift is structured and how data flows through the system.

Data Ingestion Pipeline

The Redshift ingestion pipeline represents the flow of data from operational systems and external data sources into Amazon Redshift. This process typically involves extraction from sources, transformation into optimized formats, and loading into Amazon S3 before final ingestion into Redshift.

Redshift ingestion pipeline showing data sources flowing through ETL into S3 and then into Amazon Redshift

Data Sources – Operational databases, application logs, third-party APIs, event streams, and files.
ETL / ELT Layer – AWS Glue, Lambda, EMR, or third-party tools perform transformations and cleaning.
Amazon S3 – The primary ingestion landing zone used for staging raw and transformed data.
Amazon Redshift – Final destination for structured analytics workloads. Data is typically loaded using the COPY command for maximum throughput.

End-to-End Architecture

The end-to-end Redshift architecture shows the complete lifecycle of data – from initial collection, through ingestion and transformation, into Amazon Redshift, and finally consumed by analytics and BI tools.

Redshift end-to-end architecture diagram showing data sources, ingestion, AWS Glue transformations, Redshift warehouse, and analytics outputs

Data Sources – Databases, applications, IoT, streaming services, and file-based inputs.
Ingestion Pipeline – Data flows into S3 or streams through Kinesis before entering Redshift.
AWS Glue – Optional transformation stage to clean, normalize, or enrich data before SQL analytics.
Amazon Redshift – Central data warehouse storing curated datasets, fact tables, and dimensional models.
Analytics & Business Intelligence – Tools such as QuickSight, Tableau, Looker, and Jupyter notebooks read from Redshift for dashboards, reporting, and ML workloads.