Amazon Textract is a document-AI service that extracts printed text, handwriting, forms (key-value pairs), tables, and signatures from scanned documents and PDFs. Unlike simple OCR, Textract preserves document structure, making it a natural upstream step for document-processing pipelines and generative-AI ingestion.
import boto3, time
textract = boto3.client("textract", region_name="us-west-2")
job = textract.start_document_analysis(
DocumentLocation={"S3Object": {"Bucket": "my-docs", "Name": "invoices/inv-0421.pdf"}},
FeatureTypes=["FORMS", "TABLES"],
)
job_id = job["JobId"]
while True:
status = textract.get_document_analysis(JobId=job_id)
if status["JobStatus"] in ("SUCCEEDED", "FAILED"):
break
time.sleep(2)
# Walk the response and emit key-value pairs
blocks_by_id = {b["Id"]: b for b in status["Blocks"]}
for block in status["Blocks"]:
if block["BlockType"] == "KEY_VALUE_SET" and "KEY" in block.get("EntityTypes", []):
# KEY block has a CHILD relationship to WORD blocks for the label
# and a VALUE relationship to a paired VALUE block
key_text = " ".join(
blocks_by_id[cid]["Text"]
for rel in block.get("Relationships", [])
if rel["Type"] == "CHILD"
for cid in rel["Ids"]
if blocks_by_id[cid]["BlockType"] == "WORD"
)
print(key_text, "->", block.get("Confidence"))
Textract is structure-aware: it returns key-value pairs, table cells, layout regions, and signatures alongside raw text — not just words and bounding boxes. That removes most of the post-processing logic you'd otherwise build on top of plain OCR.
Use Forms when the document has labeled fields and you want every key/value pair. Use Queries when you want specific facts ("What is the policy number?", "Who is the borrower?") regardless of where they appear in the document — fewer API calls than parsing the whole forms response.
Use the async API: StartDocumentAnalysis with the PDF in S3, then poll GetDocumentAnalysis (or subscribe to the SNS completion topic). Sync APIs only handle single-page documents.
Use Textract (or its built-in Bedrock Knowledge Base parser integration) to convert scanned PDFs into structured text with layout. Chunk by Layout blocks (paragraphs, headers), embed with Titan or Cohere embeddings, store in OpenSearch/Aurora pgvector, and serve via Bedrock Knowledge Bases.
AnalyzeExpense returns semantic fields the model already understands (vendor, total, due date, line items) without you mapping arbitrary keys. Forms+Tables returns generic key-value pairs you'd then have to normalize. AnalyzeExpense is the fast path for receipts and invoices.
Sample first to verify which features are actually needed (often DetectDocumentText alone suffices), enable only required FeatureTypes, batch-process in async, and consider preprocessing (downsizing scans, removing blank pages) before submission.
Textract pairs naturally with Bedrock: use Textract to convert documents into structured text, then hand the output to a foundation model for reasoning, summarization, or question answering.