Retrieval-Augmented Generation (RAG) Using Snowflake Cortex

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with text generation. Instead of relying solely on a standalone large language model (LLM), RAG first retrieves relevant data from a knowledge base and provides that context to the LLM. This approach makes responses more accurate and grounded in up-to-date information without requiring model retraining.

Why Use Snowflake Cortex for RAG?

Key Components of Snowflake Cortex for RAG

1. Cortex Search Service

A fully managed hybrid search service that powers the retrieval component of RAG applications.

2. Cortex LLM Functions

SQL and Python functions that provide access to industry-leading LLMs for the generation component.

RAG Architecture with Snowflake Cortex

  1. Data Ingestion: Store your documents in Snowflake tables
  2. Chunking: Split documents into smaller contextually rich blocks (recommended: 512 tokens or less)
  3. Indexing: Create a Cortex Search Service that automatically embeds and indexes your text
  4. Retrieval: Query the search service to find relevant document chunks
  5. Augmentation: Combine retrieved context with the user query
  6. Generation: Pass the augmented prompt to an LLM via the COMPLETE function

Implementation Steps

Step 1: Set Up Database and Warehouse

-- Create database and warehouse
CREATE DATABASE IF NOT EXISTS cortex_search_db;

CREATE OR REPLACE WAREHOUSE cortex_search_wh WITH
   WAREHOUSE_SIZE='X-SMALL';

CREATE OR REPLACE SCHEMA cortex_search_db.services;

Step 2: Create and Populate Your Knowledge Base Table

-- Create table for documents
CREATE OR REPLACE TABLE support_transcripts (
    transcript_text VARCHAR,
    region VARCHAR,
    agent_id VARCHAR
);

-- Insert sample data
INSERT INTO support_transcripts VALUES
    ('My internet has been down since yesterday, can you help?', 'North America', 'AG1001'),
    ('I was overcharged for my last bill, need an explanation.', 'Europe', 'AG1002'),
    ('How do I reset my password? The email link is not working.', 'Asia', 'AG1003'),
    ('I received a faulty router, can I get it replaced?', 'North America', 'AG1004');

Step 3: Create the Cortex Search Service

-- Create Cortex Search Service for RAG retrieval
CREATE OR REPLACE CORTEX SEARCH SERVICE transcript_search_service
  ON transcript_text
  ATTRIBUTES region
  WAREHOUSE = cortex_search_wh
  TARGET_LAG = '1 day'
  EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
  AS (
    SELECT
        transcript_text,
        region,
        agent_id
    FROM support_transcripts
);

Step 4: Query the Search Service (Retrieval)

-- Preview search results
SELECT PARSE_JSON(
  SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
      'cortex_search_db.services.transcript_search_service',
      '{
        "query": "internet issues",
        "columns":["transcript_text", "region"],
        "filter": {"@eq": {"region": "North America"} },
        "limit": 5
      }'
  )
)['results'] as results;

Step 5: Generate Response Using LLM (Generation)

-- Use COMPLETE function for RAG response generation
SELECT SNOWFLAKE.CORTEX.COMPLETE(
    'mistral-large',
    CONCAT(
        'Based on the following context, answer the user question.\n\n',
        'Context: ', retrieved_context, '\n\n',
        'Question: ', user_question, '\n\n',
        'Answer:'
    )
) AS response;

Complete RAG Query Example

-- Full RAG pipeline combining retrieval and generation
WITH retrieved_docs AS (
    SELECT PARSE_JSON(
        SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
            'cortex_search_db.services.transcript_search_service',
            '{
                "query": "billing overcharge",
                "columns": ["transcript_text", "region"],
                "limit": 3
            }'
        )
    )['results'] AS results
)
SELECT SNOWFLAKE.CORTEX.COMPLETE(
    'mistral-large',
    CONCAT(
        'You are a helpful customer support assistant. ',
        'Based on the following support transcripts, provide a helpful response.\n\n',
        'Relevant transcripts: ', results::VARCHAR, '\n\n',
        'Customer question: How do I dispute a billing charge?\n\n',
        'Response:'
    )
) AS rag_response
FROM retrieved_docs;

Python Implementation with Snowpark

from snowflake.core import Root
from snowflake.snowpark import Session
from snowflake.cortex import Complete

# Create session
CONNECTION_PARAMETERS = {
    "account": "your_account",
    "user": "your_user",
    "password": "your_password",
    "warehouse": "cortex_search_wh",
    "database": "cortex_search_db",
    "schema": "services"
}

session = Session.builder.configs(CONNECTION_PARAMETERS).create()
root = Root(session)

# Get the search service
search_service = (root
    .databases["cortex_search_db"]
    .schemas["services"]
    .cortex_search_services["transcript_search_service"]
)

# Retrieval: Search for relevant documents
user_query = "internet connection problems"
search_response = search_service.search(
    query=user_query,
    columns=["transcript_text", "region"],
    filter={"@eq": {"region": "North America"}},
    limit=3
)

# Get retrieved context
retrieved_context = search_response.to_json()

# Generation: Use LLM with retrieved context
prompt = f"""Based on the following customer support transcripts, 
provide a helpful response to the user's question.

Retrieved transcripts: {retrieved_context}

User question: {user_query}

Response:"""

response = Complete(
    model="mistral-large",
    prompt=prompt,
    session=session
)

print(response)

Available Embedding Models for Cortex Search

Available LLMs for Generation

Best Practices

For Optimal Retrieval Quality

For Better Generation

Cost Considerations

Key Benefits Summary

  1. No External Dependencies: Everything runs within Snowflake
  2. Enterprise Security: Data never leaves Snowflake's governance boundary
  3. Automatic Updates: Search index refreshes automatically when data changes
  4. Hybrid Search: Combines semantic and keyword search for best results
  5. Scalability: Supports up to 100M rows per search service
  6. SQL-First Approach: Build RAG applications using familiar SQL syntax


Common Interview Questions:

What is Cortex Search and how does it differ from rolling your own embeddings pipeline?

Cortex Search is a managed hybrid (vector + keyword) retrieval service inside Snowflake. You point it at a column in a table, pick an embedding model, and Snowflake handles chunking, embedding, indexing, and incremental refresh as the underlying table changes. Querying is a single SQL function call. A DIY pipeline using EMBED_TEXT_768() plus a manual VECTOR column gives you more control (custom chunkers, multiple embedding models, custom retrieval logic) but you own the refresh, the indexing strategy, and the BM25 fusion. Cortex Search wins on operational simplicity; DIY wins on flexibility.

How do you handle chunking inside Snowflake?

Two patterns: let Cortex Search handle it (you set CHUNK_SIZE and CHUNK_OVERLAP in the service definition), or pre-chunk in SQL/Snowpark and store one row per chunk. Pre-chunking is preferable when you have layout-aware splitting requirements — e.g., never split inside a code block, keep table rows together, or use semantic chunking from a Python UDF. Store doc_id, chunk_index, chunk_text, and any metadata needed for filtering in one table; Cortex Search indexes the text column and returns matching rows with metadata for citation.

How do you produce citations from a Cortex RAG pipeline?

Retrieval returns the chunk text plus the metadata columns you indexed (doc_id, source URL, page number, last_modified). Pass these to CORTEX.COMPLETE() in the prompt as a numbered list, and instruct the model to emit answers with bracketed citations like [1] referring to the list. In the application, render each citation as a link back to the source. For higher fidelity, return the chunk offsets so the UI can highlight the exact span. The whole pipeline stays auditable — you can replay any answer back to the exact rows that produced it.

How do you enforce per-user access in a Snowflake RAG app?

Use Snowflake's native row access policies on the source table — the same policy that gates raw row access also gates which chunks a user can retrieve, because Cortex Search honors the caller's role. Document-level ACLs become a column (allowed_roles ARRAY) plus a policy that filters rows where the current role is not in the array. This is a major advantage over external vector stores: governance lives with the data, not in app code, so a misconfigured app cannot leak documents the user shouldn't see.

What embedding model should you pick in Cortex?

Snowflake exposes several through EMBED_TEXT_768 and EMBED_TEXT_1024 — common choices include snowflake-arctic-embed-m, e5-base-v2, and nv-embed-qa-4. For most English RAG, start with snowflake-arctic-embed-m — strong MTEB performance and tightly integrated. For multilingual corpora pick a multilingual variant. Run a small evaluation on your domain (recall@5 on a labeled query/relevant-chunk set) before committing — the right model is workload-dependent and the differences are not subtle.

How would you scale and monitor Cortex RAG to a million-document corpus?

Partition the corpus into multiple Cortex Search services along a natural axis (region, business unit, document type) so each service stays under the per-service row limit and refreshes independently. Use a thin Streamlit-in-Snowflake or external app that fans out queries to the relevant services and merges results. Monitor via SNOWFLAKE.ACCOUNT_USAGE.CORTEX_SEARCH_SERVING_USAGE_HISTORY and CORTEX_FUNCTIONS_USAGE_HISTORY for cost; track recall offline with a labeled eval set; alarm on refresh-job lag (last refreshed_on timestamp older than your SLA). For freshness-critical content, drive refresh from a stream + task on the source table rather than the default schedule.