Anonymizing a PostgreSQL Dump for a Developer

Java · Go · Node · Python

How to hand a developer a dump of a production database without breaking the law: what PII is, six anonymization strategies, a full script example, postgresql_anonymizer, and protection against re-identification.

Published 27 июня 2026 ~7 min read Author: Вадим Викулин

← Back to the section

A developer needs to reproduce a bug that only shows up on real data. The first impulse is to hand over a dump of the production database. The problem: that violates personal data law (GDPR in Europe, 152-FZ in Russia). Here we look at how to give a developer usable data without breaking the law.

Why you can't just hand over the dump as is

Every copy of production data is a new surface for a leak. If the dump ends up in the wrong place (a laptop, an unencrypted drive, a messenger), you take on legal liability and users get their privacy violated. They never consented to their data being stored on an intern's work laptop.

The good news: almost all development tasks are solved with synthetic (generated) data. A real dump is rarely needed — only when reproducing a specific problem with a specific set of rows.

What PII is

PII (Personally Identifiable Information) is data that can be used to identify a person.

Direct identifiers — point unambiguously to a specific person:

first name, last name, middle name;
email and phone;
address, passport, SNILS, INN;
date of birth;
IP address;
bank card or account number.

Indirect identifiers — safe on their own, but together they let you pinpoint a person:

precise geolocation;
the combination "occupation + city + year of birth" (in a small town such a triple is already unique);
the browser's User-Agent and fingerprint fields.

Business data — not PII, but also sensitive:

B2B prices and discounts for specific clients;
internal moderator comments;
logs of employee actions.

Six anonymization strategies

Deletion — set NULL. Use this when the field isn't needed for debugging at all.

UPDATE customer SET address = NULL, apartment = NULL;

Replacing with a constant — every row gets the same value. Fits when you need the structure but not the content.

UPDATE customer SET email = 'masked@example.com';

Pseudonymization via hashing — each unique email turns into a unique pseudonym. Relationships between rows are preserved (the same email → the same pseudonym), and the original can't be restored without the key.

UPDATE customer
SET email = 'user' || substring(md5(email), 1, 8) || '@example.test'
WHERE email IS NOT NULL;

Replacing with random data (faker) — instead of a real name, a random but realistic one is substituted. The dataset looks natural.

Shuffling — values are permuted between rows. The distribution is preserved, but the link to a specific person is lost.

Generalization — a precise value is replaced with a less precise one: date of birth → year, city → region. Used for analytical data, where the distribution matters.

How to mask specific fields

Email

Pseudonymization: the same email always yields the same pseudonym, so unique constraints and foreign keys don't break.

UPDATE customer
SET email = 'user' || substring(md5(email), 1, 8) || '@example.test'
WHERE email IS NOT NULL;

Phone

A random number in Russian mobile format:

UPDATE customer
SET phone = '+7000' || lpad((random() * 10000000)::int::text, 7, '0')
WHERE phone IS NOT NULL;

First and last name

The simplest option is to append a suffix from the ID (uniqueness guaranteed):

UPDATE customer SET
    first_name = 'Name' || id,
    last_name  = 'Surname' || id;

If you need realistic names, use postgresql_anonymizer (more on that below) or test-data generation libraries on the application side: JavaFaker (Java), faker-js (Node/TypeScript), Faker (Python), gofakeit (Go).

Date of birth

Drop the specific day, keep the year — the age is preserved approximately:

UPDATE customer SET born_on = date_trunc('year', born_on)::date;

Card data

Zero out everything except the technical fields:

UPDATE payment SET
    card_last4 = '0000',
    card_holder_name = 'TEST';

Text fields (comments, descriptions)

Free text is the tricky case. A user might have written their phone number or address in a comment. The most reliable approach is to nullify it or replace it with a placeholder:

UPDATE order_comment SET body = '[REDACTED]' WHERE body IS NOT NULL;

If the content matters for debugging, use NER (named entity recognition) via an external tool or the anon.partial_text() function from postgresql_anonymizer.

Full scenario: from production database to dump

Never anonymize data directly in the production database. The procedure:

# 1. Copy the data into a separate anonymizer database
pg_dump prod | psql anonymizer

-- 2. Anonymize everything in a single transaction
BEGIN;

UPDATE customer SET
    email      = 'user' || substring(md5(email), 1, 8) || '@example.test',
    phone      = '+7000' || lpad((random()*10000000)::int::text, 7, '0'),
    first_name = 'Name' || id,
    last_name  = 'Surname' || id;

UPDATE customer_address SET
    street    = NULL,
    building  = NULL,
    apartment = NULL;

UPDATE customer_document SET
    number    = '0000' || lpad(id::text, 6, '0'),
    issued_by = 'TEST';

UPDATE payment SET
    card_last4        = '0000',
    card_holder_name  = 'TEST';

DELETE FROM audit_log WHERE created_at < now() - interval '7 days';

COMMIT;

# 3. Make a dump of the anonymized database
pg_dump anonymizer -Fc > prod-anon-$(date +%F).dump

# 4. Hand it to the developer

# 5. Drop the anonymizer database — don't leave half-processed data around
dropdb anonymizer

Important: the anonymization script lives in the repository and goes through code review. Don't rewrite it from memory every time — that's a source of errors.

postgresql_anonymizer

postgresql_anonymizer is a PostgreSQL extension that adds built-in functions for generating realistic data and declarative masking rules.

CREATE EXTENSION anon CASCADE;
SELECT anon.init();

-- Declare masking rules
SECURITY LABEL FOR anon ON COLUMN customer.email
    IS 'MASKED WITH FUNCTION anon.fake_email()';

SECURITY LABEL FOR anon ON COLUMN customer.last_name
    IS 'MASKED WITH FUNCTION anon.fake_last_name()';

-- Make a masked dump
SELECT anon.dump('customer');

The extension also supports dynamic masking: certain roles see only masked data even in the live database. This is useful if you need to give a developer direct database access without the hassle of dumps.

Alternatives: Greenmask, ARX, custom scripts.

Re-identification: when masking the name isn't enough

A classic mistake: replace the name and email but leave the exact date of birth, occupation, and city intact. In a small town the combination "doctor, born 1985, Kostroma" may be unique — the person can be found without a name.

This risk is called re-identification — identifying someone again through a combination of indirect attributes.

How to protect against it:

generalize identifier fields (date of birth → year, city → region);
remove or merge rare categories;
for analytical data, apply k-anonymity: each row must be indistinguishable from at least k others across all identifying fields.

In practice, for most development tasks generalizing the fields is enough — k-anonymity is needed when handing data over for analytics.

Common mistakes

Handing over a dump without anonymization "just to a trustworthy person" — the law makes no exceptions for trustworthiness. Every copy of the data requires a legal basis (consent or a DPA).

Masking only the name and forgetting about email or phone — anonymization only works if it covers all of the PII, not part of it.

Leaving the anonymizer database around after the dump — half-processed data remains a surface for a leak. Drop it right away.

Leaving free text (comments) unprocessed — it may contain phone numbers and names written by users.

Anonymizing directly in the production database — any error in the script will irreversibly change the data.

In short

Handing over a production dump without anonymization breaks the law.
For most development tasks, synthetic (generated) data is enough.
PII: email, phone, full name, address, documents, date of birth, IP, card; indirect: geolocation, unique combinations.
Six strategies: deletion, replacing with a constant, pseudonymization (md5), faker, shuffling, generalization.
Email is pseudonymized via md5 — uniqueness is preserved, and it can't be restored.
Full scenario: copy of prod → anonymization script → dump → drop the anonymizer database.
The anonymization script lives in the repository and goes through review.
Re-identification is a real threat: masking the name isn't enough if the exact date of birth and city remain.