PII Detection: Finding Personal Data in Your Codebase and Databases

Before you can protect personal data, you need to know where it is. Most organizations are surprised to discover the breadth of places PII ends up: not just in the obvious user tables, but in application logs, S3 bucket contents, database backups, CI/CD artifacts, third-party integrations, and developer laptops. This guide covers the tooling and methodology for systematically finding personal data across your infrastructure.

Why PII Discovery Matters

GDPR Article 30 requires a Record of Processing Activities (ROPA) — an accurate inventory of what personal data you process, where it lives, and for what purpose. You cannot maintain an accurate ROPA without scanning your systems to find data you may not know exists.

Common sources of unexpected PII:

Application logs: Email addresses in authentication logs, user names in error messages, IP addresses in access logs, support ticket content in debug output.
S3 / object storage: Export files, backup archives, user-uploaded content that was supposed to be processed and deleted but persisted.
Database backups: Snapshots that predate your current data minimization policies may contain fields you have since removed from the live schema.
Analytics pipelines: Raw event streams sent to a data warehouse before PII scrubbing was implemented.
Git repositories: Test fixtures with real user data, config files with email addresses, hardcoded credentials that include user information.
Third-party integrations: Data that was synced to a CRM, support tool, or marketing platform and now exists in their systems independently.
CI/CD pipeline artifacts: Test run outputs, coverage reports, or deployment logs that captured environment variables or database contents during testing.

Amazon Macie

Amazon Macie is a managed data security service that uses machine learning to discover and protect sensitive data in Amazon S3. It is the right tool for organizations with significant data in S3 buckets.

What Macie detects: Macie has built-in managed data identifiers for 100+ types of sensitive data, organized into categories:

Credentials (AWS keys, GitHub tokens, private keys, passwords)
Financial information (credit card numbers, bank account numbers, SWIFT codes)
Personal health information (National Drug Codes, medical record identifiers)
PII (names, dates of birth, passport numbers, driver's license numbers, national IDs)
Contact information (email addresses, phone numbers, physical addresses)

Enabling Macie:

# Enable Macie
aws macie2 enable-macie

# Create a classification job targeting specific S3 buckets
aws macie2 create-classification-job \
  --job-type ONE_TIME \
  --name "Initial PII Discovery" \
  --s3-job-definition '{
    "bucketDefinitions": [
      {
        "accountId": "123456789012",
        "buckets": ["my-app-uploads", "my-app-exports", "my-app-backups"]
      }
    ]
  }' \
  --managed-data-identifier-selector ALL

Interpreting results: Macie produces findings with a severity level (low/medium/high/critical), the specific data identifier that triggered the finding, and the S3 object path. High-severity findings (credit card numbers, social security numbers) should be remediated immediately. Medium findings (email addresses in logs) require assessment — are they expected and controlled, or unexpected leakage?

Cost consideration: Macie charges per GB scanned. Run an initial one-time job to establish a baseline, then configure ongoing monitoring only for buckets with high data sensitivity or frequent writes.

Custom data identifiers: If Macie's managed identifiers do not cover your specific PII types (e.g., your internal customer ID format, which can be used to link to personal data), you can create custom identifiers using regular expressions:

aws macie2 create-custom-data-identifier \
  --name "Internal Customer ID" \
  --regex "CUST-[0-9]{8}" \
  --description "Internal customer identifier that links to PII"

Google Cloud DLP

Google Cloud Data Loss Prevention (Cloud DLP) is Google's equivalent — a fully managed service for inspecting, classifying, and de-identifying sensitive data across GCS, BigQuery, Datastore, and arbitrary text inputs.

InfoType detectors: Cloud DLP uses "infoType" detectors analogous to Macie's managed identifiers. Available infoTypes include EMAIL_ADDRESS, PHONE_NUMBER, PERSON_NAME, DATE_OF_BIRTH, CREDIT_CARD_NUMBER, US_SOCIAL_SECURITY_NUMBER, IBAN_CODE, and 150+ others.

Scanning a GCS bucket:

from google.cloud import dlp_v2

def scan_bucket_for_pii(project_id: str, bucket_name: str) -> None:
    dlp = dlp_v2.DlpServiceClient()

    storage_config = dlp_v2.StorageConfig(
        cloud_storage_options=dlp_v2.CloudStorageOptions(
            file_set=dlp_v2.CloudStorageOptions.FileSet(
                url=f"gs://{bucket_name}/**"
            )
        )
    )

    inspect_config = dlp_v2.InspectConfig(
        info_types=[
            dlp_v2.InfoType(name="EMAIL_ADDRESS"),
            dlp_v2.InfoType(name="PHONE_NUMBER"),
            dlp_v2.InfoType(name="PERSON_NAME"),
            dlp_v2.InfoType(name="CREDIT_CARD_NUMBER"),
        ],
        min_likelihood=dlp_v2.Likelihood.POSSIBLE,
        include_quote=False  # Don't include the actual PII value in findings
    )

    # Trigger a DLP job
    job = dlp.create_dlp_job(
        parent=f"projects/{project_id}/locations/global",
        inspect_job=dlp_v2.InspectJobConfig(
            storage_config=storage_config,
            inspect_config=inspect_config,
        )
    )
    print(f"DLP job created: {job.name}")

Scanning BigQuery: Cloud DLP can inspect entire BigQuery datasets or specific tables. This is valuable for data warehouse scans where PII may have accumulated in analytical tables.

De-identification: A powerful DLP feature is de-identification — Cloud DLP can rewrite sensitive data in place using techniques like tokenization, masking, bucketing, or date shifting. This is useful for creating sanitized copies of production data for development environments.

Open-Source PII Scanning in Code Repositories

Commercial cloud tools do not scan your source code for embedded PII. For codebase scanning, a combination of open-source tools and custom regex patterns is more practical.

detect-secrets (by Yelp): Primarily designed for secrets, but configurable for PII patterns. Integrates as a pre-commit hook.

Custom regex scanning with ripgrep: For CI integration, a ripgrep scan with a set of PII patterns provides fast detection:

# Scan for common PII patterns in source code
rg --type-add 'source:*.{ts,js,tsx,jsx,py,go,java,rb}' \
   -t source \
   -e "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" \
   -e "\b\d{3}[-.]?\d{2}[-.]?\d{4}\b" \
   -e "\b4[0-9]{12}(?:[0-9]{3})?\b" \
   --no-messages \
   ./src

This scans for email addresses, US SSN patterns, and Visa card numbers. Extend the patterns for your specific context.

PII in test fixtures: A common finding is real customer data in test fixtures or seed files. Establish a policy that test data must be synthetically generated, and scan your test/fixtures and seed/ directories specifically.

Detecting PII in Application Logs

Log PII is one of the hardest problems because logging is pervasive and often ad-hoc. Engineers add console.log(user) during debugging and never remove it. The solution is a combination of preventive controls (log sanitization at ingestion) and detective controls (scanning existing logs).

Preventive: structured logging with PII fields filtered at write time:

import pino from 'pino';

// List of keys that should never appear in logs
const PII_KEYS = ['email', 'password', 'name', 'phone', 'address', 'ssn', 'creditCard', 'ipAddress'];

const logger = pino({
  serializers: {
    // Redact PII fields from any logged object
    user: (user) => ({
      userId: user.userId,
      role: user.role
      // email, name, etc. are intentionally omitted
    })
  },
  redact: {
    paths: PII_KEYS.map(key => `[*].${key}`),
    censor: '[REDACTED]'
  }
});

Pino's redact configuration applies a censor to any matching key at any depth in the logged object, preventing PII from reaching your log aggregator.

Detective: scanning logs in your log management system: Most log management platforms (Datadog, Splunk, CloudWatch Logs Insights) support regex-based queries. Run periodic queries against your log data to identify PII leakage:

In Datadog:

@message:/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/

In CloudWatch Logs Insights:

fields @timestamp, @message
| filter @message like /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/
| limit 100

When you find PII in logs, the remediation steps are:

Fix the logging call that introduced it
Delete the affected log entries (most platforms support log deletion or expiry)
Assess whether the log data was exported to a data warehouse or SIEM that also needs remediation

Database PII Scanning

For PostgreSQL and MySQL databases, scanning for PII requires querying data in text columns and applying pattern matching. This should be done on a read replica to avoid impacting production performance.

-- Find email addresses in any text column of the users table
SELECT
  column_name,
  COUNT(*) as match_count
FROM (
  SELECT
    'email_field' as column_name,
    email
  FROM users
  WHERE email ~ '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

  UNION ALL

  SELECT
    'bio_field' as column_name,
    bio
  FROM users
  WHERE bio ~ '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
) subq
GROUP BY column_name;

For MongoDB, use the aggregation pipeline with $regexMatch:

db.collection('activityLogs').aggregate([
  {
    $match: {
      $or: [
        { message: { $regex: /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/ } },
        { 'metadata.userInput': { $regex: /\d{3}[-.]?\d{2}[-.]?\d{4}/ } }
      ]
    }
  },
  { $count: 'pii_findings' }
]);

Building a PII Discovery Workflow

Combine these tools into a repeatable process:

Baseline scan: Run Macie (AWS) or Cloud DLP (GCP) against all object storage. Run custom regex scans against your database and log stores. Run repository scanning against your codebase. Document all findings.
Triage findings: Categorize each finding as expected (controlled, documented), unexpected-acceptable (low risk, document and monitor), or unexpected-unacceptable (remediate).
Remediate unexpected findings: Delete or pseudonymize data that should not be where it is. Fix the process that produced it.
Update the ROPA: Add any newly discovered data stores to your Record of Processing Activities.
Continuous scanning: Schedule regular scans (weekly for high-sensitivity stores, monthly for others). Add PII pattern checks to your CI pipeline for new code.

PII discovery is not a one-time event — new data accumulates constantly, and new features introduce new data flows. Treat it as an ongoing operational process.