Amazon Macie

Amazon Macie is a managed sensitive-data discovery service for Amazon S3. It uses ML and pattern matching to find PII, PHI, financial data, and credentials inside S3 objects, then publishes severity-graded findings to Security Hub. Macie also continuously monitors S3 bucket-level configuration (public access, encryption, sharing) so the data-discovery findings come paired with their exposure context.

1. Overview & Data Flow

Macie works in two modes — continuous bucket inventory (free, always on once enabled) and on-demand or scheduled object content scanning (charged per GB). The diagram below shows the content-scan path: S3 objects flow into the Macie scan engine, are matched against managed and custom identifiers, and emerge as findings routed downstream.

┌──────────────────────────────────────────────────────────────────────────────┐
│                         S3 DATA SOURCES                                      │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐            │
│  │  Data Lake Bkt   │  │  App Logs Bkt    │  │  Backups Bkt     │            │
│  │  (Parquet, CSV)  │  │  (JSON, gzip)    │  │  (DB dumps)      │            │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘            │
└──────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                       MACIE SCAN ENGINE                                      │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐            │
│  │ Managed Data IDs │  │  Custom Regex    │  │  Sample-Based    │            │
│  │ PII / PHI / Cred │  │   Identifiers    │  │   Discovery      │            │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘            │
└──────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                  FINDINGS (Sensitive Data)                                   │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐            │
│  │  SSN / DOB / CC  │  │  AWS Access Keys │  │  Health Records  │            │
│  │  Severity: High  │  │  Severity: High  │  │  Severity: High  │            │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘            │
└──────────────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                DOWNSTREAM ROUTING & ACTION                                   │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐            │
│  │  Security Hub    │  │   EventBridge    │  │  Lambda / Jira   │            │
│  │  ASFF Aggregate  │  │  Severity Routes │  │  Tag, Quarantine │            │
│  └──────────────────┘  └──────────────────┘  └──────────────────┘            │
└──────────────────────────────────────────────────────────────────────────────┘

2. Managed Data Identifiers

Macie ships with 100+ managed identifiers maintained by AWS. They cover the standard regulated-data categories and are constantly tuned to reduce false positives.

PII (Personally Identifiable): US/UK/Canada SSN, driver's license, passport, date of birth, addresses, phone, email.
PHI (Protected Health): ICD-9 / ICD-10 codes, NPI, HCPCS, drug names, medical record IDs.
Financial: credit card PANs (Visa, MC, Amex), CVV, US/UK/EU bank accounts, SWIFT codes, IBAN.
Credentials: AWS access keys / secret keys, Google Cloud service-account keys, Stripe API keys, SSH private keys, OpenSSH-formatted PEM blocks.
Country-specific: Brazilian CPF/CNPJ, German Personalausweis, French INSEE, etc.

3. Custom Data Identifiers

For internal patterns that AWS doesn't ship (employee IDs, product license keys, customer numbers), define a custom identifier with a regex, optional keywords, an ignore-words list, and a maximum match distance.


aws macie2 create-custom-data-identifier \
  --name "EmployeeId" \
  --regex "EMP-[0-9]{6}" \
  --keywords "employee,emp_id,personnel" \
  --maximum-match-distance 50 \
  --ignore-words "EMP-000000,EMP-999999"

Custom identifiers are evaluated on both managed and custom-only jobs. They count toward findings severity the same way managed identifiers do.

4. Job Types

One-Time Job: scans a defined set of buckets / prefixes once. Useful for pre-migration audits or ad-hoc compliance checks.
Scheduled Job: runs daily, weekly, or monthly against the same scope. Macie scans only new or modified objects on subsequent runs (incremental).
Automated Sensitive Data Discovery: account-wide continuous discovery using sampling — Macie picks a representative subset of objects from each bucket and scans them, refining over time. Cheaper and lower-touch than explicit jobs.

Job scope can be filtered by tag, prefix, file type (JSON / CSV / Parquet / Avro / Excel / archives), object size, and last-modified date. Use these filters to skip uninteresting cold storage and concentrate spend on hot data.

5. Findings & Severity

Macie produces two finding categories:

Sensitive Data findings: an object contains matches for one or more identifiers. Severity is computed from identifier sensitivity (CC# beats name+address) and match count.
Policy findings: bucket-level configuration drift — public access enabled, encryption disabled, replication to an unknown account, ACLs in use.

Severity buckets: Low (1-3), Medium (4-6), High (7-9). Each finding includes the bucket / object path, the identifier(s) matched, the count of matches, and a sample with the actual sensitive value redacted.

6. Cost-Optimization Patterns

Content scanning charges per GB; bucket inventory does not. The cost levers are object selection and sampling.

Use Automated Sensitive Data Discovery as the always-on baseline — sampling means you pay for a fraction of total volume while still getting per-bucket sensitivity scores.
Scope explicit jobs by tag (data-classification=customer) so you scan only buckets that should never contain sensitive data — or only buckets where finding sensitive data is the goal.
Skip archives & large binaries with file-type filters when the data lake also stores model weights, images, or video that won't contain the identifiers you care about.
Filter by last-modified date so a weekly job re-scans only the past 7 days, not the entire bucket.
Lifecycle-tier old findings: archive resolved findings to S3 to keep the active findings list manageable.

Approximate pricing as of 2026: ~$0.10 per GB scanned for sensitive-data discovery and ~$1.00 per bucket per month for the inventory. Free 30-day trial.

7. Integration with Security Hub

Macie publishes findings to EventBridge in real time and to Security Hub in ASFF format. Hub aggregation lets a single dashboard show "buckets with public access AND containing PII" by joining Macie data findings with Macie policy findings (and Config rules for redundancy).


import boto3

macie = boto3.client("macie2")

# List High-severity sensitive-data findings from the last 7 days
resp = macie.list_findings(
    findingCriteria={
        "criterion": {
            "severity.description": {"eq": ["High"]},
            "category": {"eq": ["CLASSIFICATION"]},
            "updatedAt": {"gte": 1714003200000},  # epoch ms
        }
    },
    maxResults=50,
    sortCriteria={"attributeName": "severity.score", "orderBy": "DESC"},
)

for fid in resp["findingIds"]:
    detail = macie.get_findings(findingIds=[fid])["findings"][0]
    print(f"{detail['severity']['description']:<6} "
          f"{detail['resourcesAffected']['s3Bucket']['name']}/"
          f"{detail['resourcesAffected']['s3Object']['key']}")

Common runbook: a High Macie finding on a public-access bucket triggers an EventBridge rule that (1) flips the bucket's BlockPublicAccess settings to all-true, (2) creates a Jira ticket assigned to the bucket's owning team via tag lookup, (3) Slack-pages the data-protection on-call.