A developer needs to reproduce a bug that only shows up on real data. The first impulse is to hand over a dump of the production database. The problem: that violates personal data law (GDPR in Europe, 152-FZ in Russia). Here we look at how to give a developer usable data without breaking the law.
Why you can't just hand over the dump as is
Every copy of production data is a new surface for a leak. If the dump ends up in the wrong place (a laptop, an unencrypted drive, a messenger), you take on legal liability and users get their privacy violated. They never consented to their data being stored on an intern's work laptop.
The good news: almost all development tasks are solved with synthetic (generated) data. A real dump is rarely needed — only when reproducing a specific problem with a specific set of rows.
What PII is
PII (Personally Identifiable Information) is data that can be used to identify a person.
Direct identifiers — point unambiguously to a specific person:
- first name, last name, middle name;
- email and phone;
- address, passport, SNILS, INN;
- date of birth;
- IP address;
- bank card or account number.
Indirect identifiers — safe on their own, but together they let you pinpoint a person:
- precise geolocation;
- the combination "occupation + city + year of birth" (in a small town such a triple is already unique);
- the browser's User-Agent and fingerprint fields.
Business data — not PII, but also sensitive:
- B2B prices and discounts for specific clients;
- internal moderator comments;
- logs of employee actions.
Six anonymization strategies
Deletion — set NULL. Use this when the field isn't needed for debugging at all.
UPDATE customer SET address = NULL, apartment = NULL;
Replacing with a constant — every row gets the same value. Fits when you need the structure but not the content.
UPDATE customer SET email = 'masked@example.com';
Pseudonymization via hashing — each unique email turns into a unique pseudonym. Relationships between rows are preserved (the same email → the same pseudonym), and the original can't be restored without the key.
UPDATE customer
SET email = 'user' || substring(md5(email), 1, 8) || '@example.test'
WHERE email IS NOT NULL;
Replacing with random data (faker) — instead of a real name, a random but realistic one is substituted. The dataset looks natural.
Shuffling — values are permuted between rows. The distribution is preserved, but the link to a specific person is lost.
Generalization — a precise value is replaced with a less precise one: date of birth → year, city → region. Used for analytical data, where the distribution matters.
How to mask specific fields
Pseudonymization: the same email always yields the same pseudonym, so unique constraints and foreign keys don't break.
UPDATE customer
SET email = 'user' || substring(md5(email), 1, 8) || '@example.test'
WHERE email IS NOT NULL;
Phone
A random number in Russian mobile format:
UPDATE customer
SET phone = '+7000' || lpad((random() * 10000000)::int::text, 7, '0')
WHERE phone IS NOT NULL;
First and last name
The simplest option is to append a suffix from the ID (uniqueness guaranteed):
UPDATE customer SET
first_name = 'Name' || id,
last_name = 'Surname' || id;
If you need realistic names, use postgresql_anonymizer (more on that below) or test-data generation libraries on the application side: JavaFaker (Java), faker-js (Node/TypeScript), Faker (Python), gofakeit (Go).
Date of birth
Drop the specific day, keep the year — the age is preserved approximately:
UPDATE customer SET born_on = date_trunc('year', born_on)::date;
Card data
Zero out everything except the technical fields:
UPDATE payment SET
card_last4 = '0000',
card_holder_name = 'TEST';
Text fields (comments, descriptions)
Free text is the tricky case. A user might have written their phone number or address in a comment. The most reliable approach is to nullify it or replace it with a placeholder:
UPDATE order_comment SET body = '[REDACTED]' WHERE body IS NOT NULL;
If the content matters for debugging, use NER (named entity recognition) via an external tool or the anon.partial_text() function from postgresql_anonymizer.
Full scenario: from production database to dump
Never anonymize data directly in the production database. The procedure:
# 1. Copy the data into a separate anonymizer database
pg_dump prod | psql anonymizer
-- 2. Anonymize everything in a single transaction
BEGIN;
UPDATE customer SET
email = 'user' || substring(md5(email), 1, 8) || '@example.test',
phone = '+7000' || lpad((random()*10000000)::int::text, 7, '0'),
first_name = 'Name' || id,
last_name = 'Surname' || id;
UPDATE customer_address SET
street = NULL,
building = NULL,
apartment = NULL;
UPDATE customer_document SET
number = '0000' || lpad(id::text, 6, '0'),
issued_by = 'TEST';
UPDATE payment SET
card_last4 = '0000',
card_holder_name = 'TEST';
DELETE FROM audit_log WHERE created_at < now() - interval '7 days';
COMMIT;
# 3. Make a dump of the anonymized database
pg_dump anonymizer -Fc > prod-anon-$(date +%F).dump
# 4. Hand it to the developer
# 5. Drop the anonymizer database — don't leave half-processed data around
dropdb anonymizer
Important: the anonymization script lives in the repository and goes through code review. Don't rewrite it from memory every time — that's a source of errors.
postgresql_anonymizer
postgresql_anonymizer is a PostgreSQL extension that adds built-in functions for generating realistic data and declarative masking rules.
CREATE EXTENSION anon CASCADE;
SELECT anon.init();
-- Declare masking rules
SECURITY LABEL FOR anon ON COLUMN customer.email
IS 'MASKED WITH FUNCTION anon.fake_email()';
SECURITY LABEL FOR anon ON COLUMN customer.last_name
IS 'MASKED WITH FUNCTION anon.fake_last_name()';
-- Make a masked dump
SELECT anon.dump('customer');
The extension also supports dynamic masking: certain roles see only masked data even in the live database. This is useful if you need to give a developer direct database access without the hassle of dumps.
Alternatives: Greenmask, ARX, custom scripts.
Re-identification: when masking the name isn't enough
A classic mistake: replace the name and email but leave the exact date of birth, occupation, and city intact. In a small town the combination "doctor, born 1985, Kostroma" may be unique — the person can be found without a name.
This risk is called re-identification — identifying someone again through a combination of indirect attributes.
How to protect against it:
- generalize identifier fields (date of birth → year, city → region);
- remove or merge rare categories;
- for analytical data, apply k-anonymity: each row must be indistinguishable from at least k others across all identifying fields.
In practice, for most development tasks generalizing the fields is enough — k-anonymity is needed when handing data over for analytics.
Common mistakes
Handing over a dump without anonymization "just to a trustworthy person" — the law makes no exceptions for trustworthiness. Every copy of the data requires a legal basis (consent or a DPA).
Masking only the name and forgetting about email or phone — anonymization only works if it covers all of the PII, not part of it.
Leaving the anonymizer database around after the dump — half-processed data remains a surface for a leak. Drop it right away.
Leaving free text (comments) unprocessed — it may contain phone numbers and names written by users.
Anonymizing directly in the production database — any error in the script will irreversibly change the data.
In short
- Handing over a production dump without anonymization breaks the law.
- For most development tasks, synthetic (generated) data is enough.
- PII: email, phone, full name, address, documents, date of birth, IP, card; indirect: geolocation, unique combinations.
- Six strategies: deletion, replacing with a constant, pseudonymization (md5), faker, shuffling, generalization.
- Email is pseudonymized via md5 — uniqueness is preserved, and it can't be restored.
- Full scenario: copy of prod → anonymization script → dump → drop the anonymizer database.
- The anonymization script lives in the repository and goes through review.
- Re-identification is a real threat: masking the name isn't enough if the exact date of birth and city remain.
Further reading
- PostgreSQL Backup — how to make a dump and restore a database.
- PostgreSQL Extensions — installing and managing extensions, including postgresql_anonymizer.