Retrieval-Augmented Generation (RAG) Using Snowflake Cortex

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with text generation. Instead of relying solely on a standalone large language model (LLM), RAG first retrieves relevant data from a knowledge base and provides that context to the LLM. This approach makes responses more accurate and grounded in up-to-date information without requiring model retraining.

Why Use Snowflake Cortex for RAG?

Fully Managed Service: No need to set up integrations, manage infrastructure, or move data outside Snowflake's governance boundary
Data Security: Your data stays within Snowflake, maintaining performance, scalability, and governance
Hybrid Search: Combines vector (semantic) search with keyword search for optimal retrieval quality
Automatic Embeddings: Cortex Search automatically generates embeddings without manual configuration
Semantic Reranking: Automatically reranks results for higher relevance

Key Components of Snowflake Cortex for RAG

1. Cortex Search Service

A fully managed hybrid search service that powers the retrieval component of RAG applications.

Enables low-latency, high-quality "fuzzy" search over Snowflake data
Automatically handles embedding, infrastructure maintenance, and index refreshes
Supports filtering based on metadata tags

2. Cortex LLM Functions

SQL and Python functions that provide access to industry-leading LLMs for the generation component.

COMPLETE: Generates responses from prompts using LLMs like Mistral, Llama, Claude, and more
EMBED_TEXT: Converts text into vector embeddings
SUMMARIZE: Summarizes text content
EXTRACT_ANSWER: Extracts answers from unstructured data

RAG Architecture with Snowflake Cortex

Data Ingestion: Store your documents in Snowflake tables
Chunking: Split documents into smaller contextually rich blocks (recommended: 512 tokens or less)
Indexing: Create a Cortex Search Service that automatically embeds and indexes your text
Retrieval: Query the search service to find relevant document chunks
Augmentation: Combine retrieved context with the user query
Generation: Pass the augmented prompt to an LLM via the COMPLETE function

Implementation Steps

Step 1: Set Up Database and Warehouse

-- Create database and warehouse
CREATE DATABASE IF NOT EXISTS cortex_search_db;

CREATE OR REPLACE WAREHOUSE cortex_search_wh WITH
   WAREHOUSE_SIZE='X-SMALL';

CREATE OR REPLACE SCHEMA cortex_search_db.services;

Step 2: Create and Populate Your Knowledge Base Table

-- Create table for documents
CREATE OR REPLACE TABLE support_transcripts (
    transcript_text VARCHAR,
    region VARCHAR,
    agent_id VARCHAR
);

-- Insert sample data
INSERT INTO support_transcripts VALUES
    ('My internet has been down since yesterday, can you help?', 'North America', 'AG1001'),
    ('I was overcharged for my last bill, need an explanation.', 'Europe', 'AG1002'),
    ('How do I reset my password? The email link is not working.', 'Asia', 'AG1003'),
    ('I received a faulty router, can I get it replaced?', 'North America', 'AG1004');

Step 3: Create the Cortex Search Service

-- Create Cortex Search Service for RAG retrieval
CREATE OR REPLACE CORTEX SEARCH SERVICE transcript_search_service
  ON transcript_text
  ATTRIBUTES region
  WAREHOUSE = cortex_search_wh
  TARGET_LAG = '1 day'
  EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
  AS (
    SELECT
        transcript_text,
        region,
        agent_id
    FROM support_transcripts
);

Step 4: Query the Search Service (Retrieval)

-- Preview search results
SELECT PARSE_JSON(
  SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
      'cortex_search_db.services.transcript_search_service',
      '{
        "query": "internet issues",
        "columns":["transcript_text", "region"],
        "filter": {"@eq": {"region": "North America"} },
        "limit": 5
      }'
  )
)['results'] as results;

Step 5: Generate Response Using LLM (Generation)

-- Use COMPLETE function for RAG response generation
SELECT SNOWFLAKE.CORTEX.COMPLETE(
    'mistral-large',
    CONCAT(
        'Based on the following context, answer the user question.\n\n',
        'Context: ', retrieved_context, '\n\n',
        'Question: ', user_question, '\n\n',
        'Answer:'
    )
) AS response;

Complete RAG Query Example

-- Full RAG pipeline combining retrieval and generation
WITH retrieved_docs AS (
    SELECT PARSE_JSON(
        SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
            'cortex_search_db.services.transcript_search_service',
            '{
                "query": "billing overcharge",
                "columns": ["transcript_text", "region"],
                "limit": 3
            }'
        )
    )['results'] AS results
)
SELECT SNOWFLAKE.CORTEX.COMPLETE(
    'mistral-large',
    CONCAT(
        'You are a helpful customer support assistant. ',
        'Based on the following support transcripts, provide a helpful response.\n\n',
        'Relevant transcripts: ', results::VARCHAR, '\n\n',
        'Customer question: How do I dispute a billing charge?\n\n',
        'Response:'
    )
) AS rag_response
FROM retrieved_docs;

Python Implementation with Snowpark

from snowflake.core import Root
from snowflake.snowpark import Session
from snowflake.cortex import Complete

# Create session
CONNECTION_PARAMETERS = {
    "account": "your_account",
    "user": "your_user",
    "password": "your_password",
    "warehouse": "cortex_search_wh",
    "database": "cortex_search_db",
    "schema": "services"
}

session = Session.builder.configs(CONNECTION_PARAMETERS).create()
root = Root(session)

# Get the search service
search_service = (root
    .databases["cortex_search_db"]
    .schemas["services"]
    .cortex_search_services["transcript_search_service"]
)

# Retrieval: Search for relevant documents
user_query = "internet connection problems"
search_response = search_service.search(
    query=user_query,
    columns=["transcript_text", "region"],
    filter={"@eq": {"region": "North America"}},
    limit=3
)

# Get retrieved context
retrieved_context = search_response.to_json()

# Generation: Use LLM with retrieved context
prompt = f"""Based on the following customer support transcripts, 
provide a helpful response to the user's question.

Retrieved transcripts: {retrieved_context}

User question: {user_query}

Response:"""

response = Complete(
    model="mistral-large",
    prompt=prompt,
    session=session
)

print(response)

Available Embedding Models for Cortex Search

snowflake-arctic-embed-m-v1.5: Default model, English-only, 512 token context, fastest indexing
snowflake-arctic-embed-l-v2.0: Multilingual support, 512 token context, high quality
snowflake-arctic-embed-l-v2.0-8k: Multilingual, 8192 token context for longer documents
voyage-multilingual-2: Voyage AI model, multilingual, 32,000 token context

Available LLMs for Generation

mistral-large: High capability, good baseline for most tasks
mixtral-8x7b: Efficient mixture-of-experts model
llama3.1-70b: Meta's powerful open model
claude-3-5-sonnet: Anthropic's model for complex reasoning
snowflake-arctic: Snowflake's enterprise-grade open model

Best Practices

For Optimal Retrieval Quality

Split text into chunks of no more than 512 tokens (approximately 385 English words)
Use Snowflake's SPLIT_TEXT_RECURSIVE_CHARACTER function for chunking
Enable change tracking on source tables for automatic index updates
Use metadata filters to narrow search scope when applicable

For Better Generation

Provide clear instructions in your prompt
Include relevant context from retrieval results
Specify the desired output format
Use temperature settings to control response creativity

Cost Considerations

Virtual Warehouse Compute: Used for refreshing the search service
Embedding Tokens: Cost per token embedded into vector space
Serving Compute: Per GB/month of indexed data
LLM Inference: Cost per token processed by COMPLETE function
Storage: For materialized tables and index structures

Key Benefits Summary

No External Dependencies: Everything runs within Snowflake
Enterprise Security: Data never leaves Snowflake's governance boundary
Automatic Updates: Search index refreshes automatically when data changes
Hybrid Search: Combines semantic and keyword search for best results
Scalability: Supports up to 100M rows per search service
SQL-First Approach: Build RAG applications using familiar SQL syntax

Common Interview Questions:

What is Cortex Search and how does it differ from rolling your own embeddings pipeline?

Cortex Search is a managed hybrid (vector + keyword) retrieval service inside Snowflake. You point it at a column in a table, pick an embedding model, and Snowflake handles chunking, embedding, indexing, and incremental refresh as the underlying table changes. Querying is a single SQL function call. A DIY pipeline using EMBED_TEXT_768() plus a manual VECTOR column gives you more control (custom chunkers, multiple embedding models, custom retrieval logic) but you own the refresh, the indexing strategy, and the BM25 fusion. Cortex Search wins on operational simplicity; DIY wins on flexibility.

How do you handle chunking inside Snowflake?

Two patterns: let Cortex Search handle it (you set CHUNK_SIZE and CHUNK_OVERLAP in the service definition), or pre-chunk in SQL/Snowpark and store one row per chunk. Pre-chunking is preferable when you have layout-aware splitting requirements — e.g., never split inside a code block, keep table rows together, or use semantic chunking from a Python UDF. Store doc_id, chunk_index, chunk_text, and any metadata needed for filtering in one table; Cortex Search indexes the text column and returns matching rows with metadata for citation.

How do you produce citations from a Cortex RAG pipeline?

Retrieval returns the chunk text plus the metadata columns you indexed (doc_id, source URL, page number, last_modified). Pass these to CORTEX.COMPLETE() in the prompt as a numbered list, and instruct the model to emit answers with bracketed citations like [1] referring to the list. In the application, render each citation as a link back to the source. For higher fidelity, return the chunk offsets so the UI can highlight the exact span. The whole pipeline stays auditable — you can replay any answer back to the exact rows that produced it.

How do you enforce per-user access in a Snowflake RAG app?

Use Snowflake's native row access policies on the source table — the same policy that gates raw row access also gates which chunks a user can retrieve, because Cortex Search honors the caller's role. Document-level ACLs become a column (allowed_roles ARRAY) plus a policy that filters rows where the current role is not in the array. This is a major advantage over external vector stores: governance lives with the data, not in app code, so a misconfigured app cannot leak documents the user shouldn't see.

What embedding model should you pick in Cortex?

Snowflake exposes several through EMBED_TEXT_768 and EMBED_TEXT_1024 — common choices include snowflake-arctic-embed-m, e5-base-v2, and nv-embed-qa-4. For most English RAG, start with snowflake-arctic-embed-m — strong MTEB performance and tightly integrated. For multilingual corpora pick a multilingual variant. Run a small evaluation on your domain (recall@5 on a labeled query/relevant-chunk set) before committing — the right model is workload-dependent and the differences are not subtle.

How would you scale and monitor Cortex RAG to a million-document corpus?

Partition the corpus into multiple Cortex Search services along a natural axis (region, business unit, document type) so each service stays under the per-service row limit and refreshes independently. Use a thin Streamlit-in-Snowflake or external app that fans out queries to the relevant services and merges results. Monitor via SNOWFLAKE.ACCOUNT_USAGE.CORTEX_SEARCH_SERVING_USAGE_HISTORY and CORTEX_FUNCTIONS_USAGE_HISTORY for cost; track recall offline with a labeled eval set; alarm on refresh-job lag (last refreshed_on timestamp older than your SLA). For freshness-critical content, drive refresh from a stream + task on the source table rather than the default schedule.