Amazon Textract

Amazon Textract is a document-AI service that extracts printed text, handwriting, forms (key-value pairs), tables, and signatures from scanned documents and PDFs. Unlike simple OCR, Textract preserves document structure, making it a natural upstream step for document-processing pipelines and generative-AI ingestion.

Key Features:

Text & Handwriting OCR: Detects lines and words with bounding boxes and confidence scores across most Latin-script languages plus Chinese, Japanese, Korean, Hindi, and Arabic.
Forms Extraction: Returns key-value pairs for form fields (e.g., "Name → John Smith", "DOB → 1985-02-14") without templates or training.
Tables Extraction: Preserves rows, columns, and cells — critical for invoices, financial statements, and tabular data.
Queries: Ask natural-language questions ("What is the invoice total?") and Textract returns the span in the document that answers them.
Layout Analysis: Identifies headers, paragraphs, lists, titles, and reading order for cleaner downstream chunking.
Analyze Expense & Analyze ID: Purpose-built APIs that recognize semantic fields on receipts/invoices (vendor, total, line items) and identity documents (passport, driver's license).
Analyze Lending: End-to-end mortgage-document workflow that classifies pages and extracts fields for common loan forms.
Sync & Async: Real-time API for single pages/small PDFs; asynchronous jobs for multi-page documents stored in S3.
Bedrock Knowledge Base Integration: Output structured text directly to Bedrock Knowledge Bases as a parser for RAG ingestion.

Common Use Cases:

Invoice & Receipt Processing: Auto-extract line items and totals into accounting systems.
KYC / Onboarding: Parse identity documents and supporting forms during account creation.
Claims & Loan Processing: Automate field capture from insurance claims or mortgage applications, reducing manual data entry.
Generative AI Ingestion: Convert scanned PDFs into structured text chunks for Bedrock Knowledge Bases or custom RAG pipelines.
Historical Archive Digitization: Turn legacy paper records into searchable, structured data.

Service Limits & Quotas:

Sync API page count: single-page images or 1-page PDFs.
Async API page count: up to 3,000 pages per document (PDF or TIFF).
File size: 10 MB for sync, 500 MB for async.
Supported formats: JPEG, PNG, PDF, TIFF.
Image dimensions: minimum 50 DPI; maximum 10,000 x 10,000 pixels.
Concurrent async jobs: default soft limit varies by API (e.g., 600 for AnalyzeDocument); raise via Service Quotas.
Queries: up to 30 queries per AnalyzeDocument call.

Pricing Model:

DetectDocumentText (OCR only): billed per page, with discount tiers above 1M pages/month.
AnalyzeDocument (Forms / Tables / Queries / Signatures / Layout): billed per page per feature — enabling all four roughly quadruples per-page cost.
AnalyzeExpense / AnalyzeID: separate per-page pricing.
AnalyzeLending: per-page bundled pricing.
Common cost surprises: enabling all features when only one is needed; running OCR over large PDF backlogs without sampling first; calling sync APIs in tight loops instead of async.

Code Example:


import boto3, time

textract = boto3.client("textract", region_name="us-west-2")

job = textract.start_document_analysis(
    DocumentLocation={"S3Object": {"Bucket": "my-docs", "Name": "invoices/inv-0421.pdf"}},
    FeatureTypes=["FORMS", "TABLES"],
)
job_id = job["JobId"]

while True:
    status = textract.get_document_analysis(JobId=job_id)
    if status["JobStatus"] in ("SUCCEEDED", "FAILED"):
        break
    time.sleep(2)

# Walk the response and emit key-value pairs
blocks_by_id = {b["Id"]: b for b in status["Blocks"]}
for block in status["Blocks"]:
    if block["BlockType"] == "KEY_VALUE_SET" and "KEY" in block.get("EntityTypes", []):
        # KEY block has a CHILD relationship to WORD blocks for the label
        # and a VALUE relationship to a paired VALUE block
        key_text = " ".join(
            blocks_by_id[cid]["Text"]
            for rel in block.get("Relationships", [])
            if rel["Type"] == "CHILD"
            for cid in rel["Ids"]
            if blocks_by_id[cid]["BlockType"] == "WORD"
        )
        print(key_text, "->", block.get("Confidence"))

Common Interview Questions:

How is Textract different from generic OCR (e.g., Tesseract)?

Textract is structure-aware: it returns key-value pairs, table cells, layout regions, and signatures alongside raw text — not just words and bounding boxes. That removes most of the post-processing logic you'd otherwise build on top of plain OCR.

When should you use Queries instead of Forms extraction?

Use Forms when the document has labeled fields and you want every key/value pair. Use Queries when you want specific facts ("What is the policy number?", "Who is the borrower?") regardless of where they appear in the document — fewer API calls than parsing the whole forms response.

How do you process a multi-page PDF efficiently?

Use the async API: StartDocumentAnalysis with the PDF in S3, then poll GetDocumentAnalysis (or subscribe to the SNS completion topic). Sync APIs only handle single-page documents.

How do you combine Textract with Bedrock for RAG?

Use Textract (or its built-in Bedrock Knowledge Base parser integration) to convert scanned PDFs into structured text with layout. Chunk by Layout blocks (paragraphs, headers), embed with Titan or Cohere embeddings, store in OpenSearch/Aurora pgvector, and serve via Bedrock Knowledge Bases.

What's the difference between AnalyzeExpense and parsing an invoice with Forms+Tables?

AnalyzeExpense returns semantic fields the model already understands (vendor, total, due date, line items) without you mapping arbitrary keys. Forms+Tables returns generic key-value pairs you'd then have to normalize. AnalyzeExpense is the fast path for receipts and invoices.

How do you keep costs predictable on document backlogs?

Sample first to verify which features are actually needed (often DetectDocumentText alone suffices), enable only required FeatureTypes, batch-process in async, and consider preprocessing (downsizing scans, removing blank pages) before submission.

Textract pairs naturally with Bedrock: use Textract to convert documents into structured text, then hand the output to a foundation model for reasoning, summarization, or question answering.