Amazon Macie is a managed sensitive-data discovery service for Amazon S3. It uses ML and pattern matching to find PII, PHI, financial data, and credentials inside S3 objects, then publishes severity-graded findings to Security Hub. Macie also continuously monitors S3 bucket-level configuration (public access, encryption, sharing) so the data-discovery findings come paired with their exposure context.
Macie works in two modes — continuous bucket inventory (free, always on once enabled) and on-demand or scheduled object content scanning (charged per GB). The diagram below shows the content-scan path: S3 objects flow into the Macie scan engine, are matched against managed and custom identifiers, and emerge as findings routed downstream.
┌──────────────────────────────────────────────────────────────────────────────┐
│ S3 DATA SOURCES │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Data Lake Bkt │ │ App Logs Bkt │ │ Backups Bkt │ │
│ │ (Parquet, CSV) │ │ (JSON, gzip) │ │ (DB dumps) │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ MACIE SCAN ENGINE │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Managed Data IDs │ │ Custom Regex │ │ Sample-Based │ │
│ │ PII / PHI / Cred │ │ Identifiers │ │ Discovery │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ FINDINGS (Sensitive Data) │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ SSN / DOB / CC │ │ AWS Access Keys │ │ Health Records │ │
│ │ Severity: High │ │ Severity: High │ │ Severity: High │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ DOWNSTREAM ROUTING & ACTION │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Security Hub │ │ EventBridge │ │ Lambda / Jira │ │
│ │ ASFF Aggregate │ │ Severity Routes │ │ Tag, Quarantine │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
Macie ships with 100+ managed identifiers maintained by AWS. They cover the standard regulated-data categories and are constantly tuned to reduce false positives.
For internal patterns that AWS doesn't ship (employee IDs, product license keys, customer numbers), define a custom identifier with a regex, optional keywords, an ignore-words list, and a maximum match distance.
aws macie2 create-custom-data-identifier \
--name "EmployeeId" \
--regex "EMP-[0-9]{6}" \
--keywords "employee,emp_id,personnel" \
--maximum-match-distance 50 \
--ignore-words "EMP-000000,EMP-999999"
Custom identifiers are evaluated on both managed and custom-only jobs. They count toward findings severity the same way managed identifiers do.
Job scope can be filtered by tag, prefix, file type (JSON / CSV / Parquet / Avro / Excel / archives), object size, and last-modified date. Use these filters to skip uninteresting cold storage and concentrate spend on hot data.
Macie produces two finding categories:
Severity buckets: Low (1-3), Medium (4-6), High (7-9). Each finding includes the bucket / object path, the identifier(s) matched, the count of matches, and a sample with the actual sensitive value redacted.
Content scanning charges per GB; bucket inventory does not. The cost levers are object selection and sampling.
data-classification=customer) so you scan only buckets that should never contain sensitive data — or only buckets where finding sensitive data is the goal.Approximate pricing as of 2026: ~$0.10 per GB scanned for sensitive-data discovery and ~$1.00 per bucket per month for the inventory. Free 30-day trial.
Macie publishes findings to EventBridge in real time and to Security Hub in ASFF format. Hub aggregation lets a single dashboard show "buckets with public access AND containing PII" by joining Macie data findings with Macie policy findings (and Config rules for redundancy).
import boto3
macie = boto3.client("macie2")
# List High-severity sensitive-data findings from the last 7 days
resp = macie.list_findings(
findingCriteria={
"criterion": {
"severity.description": {"eq": ["High"]},
"category": {"eq": ["CLASSIFICATION"]},
"updatedAt": {"gte": 1714003200000}, # epoch ms
}
},
maxResults=50,
sortCriteria={"attributeName": "severity.score", "orderBy": "DESC"},
)
for fid in resp["findingIds"]:
detail = macie.get_findings(findingIds=[fid])["findings"][0]
print(f"{detail['severity']['description']:<6} "
f"{detail['resourcesAffected']['s3Bucket']['name']}/"
f"{detail['resourcesAffected']['s3Object']['key']}")
Common runbook: a High Macie finding on a public-access bucket triggers an EventBridge rule that (1) flips the bucket's BlockPublicAccess settings to all-true, (2) creates a Jira ticket assigned to the bucket's owning team via tag lookup, (3) Slack-pages the data-protection on-call.