Ask any developer whether their local database has real customer data in it, and most will say no.
Ask them to check, and most will find that it does.
Real emails in users. Real names in profiles. Real billing addresses in payments. Real IP addresses in audit_logs. Data that landed in production, got copied somewhere for debugging, and has been sitting in local databases and CI pipelines ever since.
This is not a hypothetical compliance problem. It is a real one, and it gets messier the longer it goes unaddressed.
What counts as PII in a PostgreSQL database
PII is broader than most developers expect. The obvious fields are easy to spot:
email, email_address
first_name, last_name, full_name
phone, phone_number, mobile
address, street_address, city, postal_code
date_of_birth, dob
ssn, national_id, tax_id
But in real production schemas, PII hides in less obvious places:
free-text fields like notes, description, bio that users fill in
ip_address columns in event logs and audit tables
stripe_customer_id, paypal_email — identifiers that link back to real people
JSONB columns that store user-submitted form data
metadata fields that accumulate whatever the app was logging at the time
If you have been copying your production database to dev environments without systematically anonymizing those fields, that data is on developer laptops, in CI logs, and probably in Slack at some point.
Why this matters beyond just being careful
GDPR, CCPA, HIPAA, and most other data protection frameworks have something in common: they do not distinguish between production and non-production environments. If you are processing personal data in a development environment without appropriate controls, you are in scope.
In practice, the consequences are usually:
GDPR Article 25: "data protection by design and by default" — development tooling is explicitly in scope
SOC 2 Type II: data handling controls are audited across environments, not just production
HIPAA minimum necessary rule: PHI should only be available to the systems that need it for the purpose it was collected
Beyond compliance, there is a simpler reason: real customer data in dev environments is one of the most common sources of accidental exposure. A developer shares a failing test case on Slack. A CI artifact gets retained with real names in it. A staging database backup ends up in a public S3 bucket.
The fix is not more policies. It is removing the real data from the environments where it should not be.
The naive approach: UPDATE statements after restore
The most common first attempt at anonymization looks like this:
-- Run after restoring a pg_dump to dev
UPDATE users SET
email = 'user' || id || '@example.com',
first_name = 'Test',
last_name = 'User';
UPDATE orders SET
shipping_address = '123 Test St';
This works well enough until it does not.
The problems start to accumulate:
Someone forgets to run the script, and real data ends up in a dev environment anyway.
The script is not versioned with the schema, so it breaks when new PII columns are added.
It replaces data inconsistently — the same customer gets a different fake email in users than in audit_logs, breaking join-based queries.
It runs after the fact, which means real data has already traveled through the restore pipeline.
It has no automated detection — every new PII column has to be added manually.
This is the pattern that eventually lands teams in trouble. It feels like a solution because it works most of the time. It fails when someone does not run it, or when a new field gets added and nobody updates the script. It shares the same fundamental problem as seed scripts — manual upkeep that silently falls behind.
The better approach: anonymize at extraction time
The more reliable pattern is to anonymize the data before it ever leaves the production environment, not after it arrives in dev. This applies whether you are setting up a local development environment or populating CI databases.
That means the anonymization step is baked into the snapshot process:
Connect to production (or a read replica).
Extract the rows you need.
Anonymize sensitive fields inline, during extraction.
Write the already-anonymized snapshot to wherever it will be stored.
The result is that no real PII ever travels to dev environments. What gets restored is already masked.
This matters because it removes the "forget to run the script" failure mode entirely. There is no post-restore step to forget.
What good anonymization actually requires
Replacing real values with fake ones is straightforward. Making the fake values behave like real data is harder.
A few requirements come up in practice:
Realistic fake values
email = '[email protected]' is easy to write and easy to spot. It does not behave like real email data in filtering, search, or display.
Better: generate realistic-looking fake emails that follow the
Tags:
#0
Want to run a more efficient business?
Mewayz gives you CRM, HR, Accounting, Projects & eCommerce — all in one workspace. 14-day free trial, no credit card needed.
Try Mewayz Free →