← Back to the section

Changing a database schema without taking production down is a discipline of its own. Let's look at why plain ALTER TABLE statements are dangerous under live traffic, and how to roll out changes correctly.

Why ALTER TABLE blocks everything

Imagine you want to add a column to a table with tens of millions of rows. You write ALTER TABLE orders ADD COLUMN priority integer NOT NULL DEFAULT 0 — and production freezes for several minutes.

The reason: for most schema changes PostgreSQL takes an ACCESS EXCLUSIVE lock. This lock is the strictest one: it lets no one in — not SELECT, not INSERT, not UPDATE. Worse still, it queues up behind all current queries. If a long-running report is in flight at the moment of the migration, the ALTER TABLE waits for it, and behind it a queue of all new queries piles up.

Hence the first rule: every migration starts with SET LOCAL lock_timeout = '3s'. If the lock can't be acquired within 3 seconds, the migration fails with an error instead of hanging production.

BEGIN;
SET LOCAL lock_timeout = '3s';
ALTER TABLE orders ADD COLUMN priority integer;
COMMIT;

Breaking change and N-1 compatibility

Not all schema changes are equally dangerous.

Safe changes (can be applied without special measures):

  • ADD COLUMN ... NULL — a column without the NOT NULL constraint
  • ADD COLUMN ... NOT NULL DEFAULT 'x' (PostgreSQL 11+, a constant value)
  • CREATE TABLE
  • ADD CONSTRAINT ... NOT VALID

Dangerous changes (require a special approach):

  • DROP COLUMN, RENAME COLUMN
  • Changing a column's type (ALTER TYPE)
  • ADD COLUMN NOT NULL without a default value on a large table
  • Removing a value from an enum type

Dangerous changes are also called a breaking change — they can break code that is already running. And the problem is not only the lock, but also the timing of the deploy. A typical deploy: first the migration is applied, then the new version of the application starts. Between them there is a window where the old version of the code runs against the new schema. If the schema changed incompatibly, the old code will crash.

This rule is called N-1 compatibility: a migration must be compatible with the previous version of the code.

Expand-Contract: three releases instead of one

To make a dangerous change safely, you split it across several releases. This pattern is called Expand-Contract (expand — migrate — contract).

Release 1 — expand:      add the new structure (column, table)
                         code writes to the old place, optionally to the new one too

Release 2 — migrate:     move data from old to new (backfill)
                         code writes to both places, reads from the old one

Release 3 — read new:    code reads from the new one, writes to both

Release 4 — contract:    remove the old one, drop it from the code

Each release can be rolled back independently. If something goes wrong in release 2, you can return to release 1 without losing data.

How to add a NOT NULL column

PostgreSQL 11 and newer can add a column with a constant default value instantly — the value is stored "virtually" in the metadata rather than written into every row:

-- Instant even on 100M rows
ALTER TABLE orders ADD COLUMN priority integer NOT NULL DEFAULT 0;

But if the default value is dynamic (for example, NOW() or gen_random_uuid()), PostgreSQL can't store it virtually and will go rewrite every row. Here you need expand-contract:

  1. Add the column without NOT NULL
  2. Fill existing rows in batches (see the section on backfill)
  3. Set the constraint the safe way:
-- Step 3a: add a CHECK constraint without validating old rows
ALTER TABLE orders ADD CONSTRAINT ck_orders_priority_not_null
    CHECK (priority IS NOT NULL) NOT VALID;

-- Step 3b: validate existing rows (does not block writes)
ALTER TABLE orders VALIDATE CONSTRAINT ck_orders_priority_not_null;

-- Step 3c: now SET NOT NULL is cheap — PG knows there are no NULLs
ALTER TABLE orders ALTER COLUMN priority SET NOT NULL;

-- Step 3d: drop the temporary CHECK
ALTER TABLE orders DROP CONSTRAINT ck_orders_priority_not_null;

CHECK NOT VALID executes instantly. VALIDATE takes a gentle SHARE UPDATE EXCLUSIVE lock, which does not interfere with normal queries.

How to rename a column

RENAME COLUMN can't be done in a single release. The old code will break immediately, because it will reference a column that no longer exists.

The safe sequence:

  1. Add the new column (ADD COLUMN new_name <type>)
  2. Code starts writing to both columns
  3. Move existing data in batches
  4. Code starts reading from the new column
  5. Code stops writing to the old one
  6. Drop the old column (DROP COLUMN old_name)

For read-only scenarios there is a simpler option — a view wrapper:

ALTER TABLE orders RENAME COLUMN old_name TO new_name;
CREATE VIEW orders_v AS SELECT *, new_name AS old_name FROM orders;

The old code reads from the view and sees the old name, while the new code works directly against the table.

How to change a column's type

ALTER COLUMN ... SET DATA TYPE with a type cast rewrites the whole table — the result is the same as ALTER TABLE on millions of rows. An ACCESS EXCLUSIVE lock held for a long time.

Exceptions that work instantly (without a rewrite):

  • varchartext (widening without a cast)
  • varchar(50)varchar(100) (widening only)

For everything else (for example, integerbigint) you need expand-contract: add a shadow column of the new type, fill it with data, switch the code over.

How to add a foreign key

A plain ADD FOREIGN KEY locks both tables and validates all existing rows. On large tables, that's slow.

The safe way is two-step:

-- Step 1: create the FK without validating existing rows (instant)
ALTER TABLE order_item
  ADD CONSTRAINT fk_order_item_order_id
  FOREIGN KEY (order_id) REFERENCES orders(id)
  NOT VALID;

-- Step 2: validate existing rows (does not block writes)
ALTER TABLE order_item VALIDATE CONSTRAINT fk_order_item_order_id;

NOT VALID means: new rows will be validated immediately, old ones at VALIDATE time.

How to add an index

CREATE INDEX without extra keywords takes a SHARE lock, which blocks INSERT/UPDATE/DELETE. On a large table, that's several minutes with no writes.

The solution is CREATE INDEX CONCURRENTLY. It builds the index in several passes without locking:

CREATE INDEX CONCURRENTLY ix_orders_status ON orders (status);

DROP INDEX also takes a hard lock — use DROP INDEX CONCURRENTLY.

If CREATE INDEX CONCURRENTLY was interrupted, it leaves an INVALID index behind. You need to drop it and create it again:

DROP INDEX CONCURRENTLY IF EXISTS ix_orders_status;
CREATE INDEX CONCURRENTLY ix_orders_status ON orders (status);

An important detail for migration tools (Liquibase, Flyway): CONCURRENTLY can't run inside a transaction. Changesets with indexes must have runInTransaction="false":

<changeSet id="20260507-add-status-index" runInTransaction="false">
    <sql>CREATE INDEX CONCURRENTLY ix_orders_status ON orders (status);</sql>
    <rollback>DROP INDEX CONCURRENTLY IF EXISTS ix_orders_status;</rollback>
</changeSet>

How to remove a value from an enum

PostgreSQL has no REMOVE VALUE FROM ENUM command. You can't remove a value natively.

The only way is to create a new type without the unwanted value and switch the column over to it:

  1. CREATE TYPE order_status_v2 AS ENUM ('NEW', 'PAID', 'SHIPPED') (without the value being removed)
  2. ADD COLUMN status_v2 order_status_v2 NULL
  3. Move data in batches: UPDATE orders SET status_v2 = status::text::order_status_v2
  4. Switch the code over to reading from and writing to status_v2
  5. DROP COLUMN status, RENAME COLUMN status_v2 TO status, DROP TYPE order_status

Adding a value (ADD VALUE) is instant (PostgreSQL 12+), but you can't use the new value in the same transaction. It must be a separate changeset.

Moving data in batches

A large UPDATE in a single transaction is a bad idea: the transaction holds locks for its entire duration, accumulates WAL, and interferes with autovacuum.

The right approach is to update in small chunks with a commit between them:

DO $$
DECLARE rows_updated integer := 1;
BEGIN
    WHILE rows_updated > 0 LOOP
        UPDATE orders SET priority = 0
        WHERE id IN (
            SELECT id FROM orders
            WHERE priority IS NULL
            LIMIT 10000
        );
        GET DIAGNOSTICS rows_updated = ROW_COUNT;
        COMMIT;
        PERFORM pg_sleep(0.1);
    END LOOP;
END $$;

For very large tables an SQL batch inside the migration still occupies a single connection and holds up the deploy. A more reliable option is a background job in the application code: it runs independently of the deploy, and it can be stopped and restarted. Use FOR UPDATE SKIP LOCKED to avoid conflicting with user queries:

@Component
public class BackfillPriorityJob {

    private final JdbcTemplate jdbc;

    public BackfillPriorityJob(JdbcTemplate jdbc) {
        this.jdbc = jdbc;
    }

    @Scheduled(cron = "0 * * * * *")
    public void backfillPriority() {
        int updated;
        do {
            updated = jdbc.update("""
                UPDATE orders SET priority = 0
                WHERE id IN (
                    SELECT id FROM orders
                    WHERE priority IS NULL
                    LIMIT 10000
                    FOR UPDATE SKIP LOCKED
                )
                """);
        } while (updated == 10000);
    }
}

Rolling back migrations barely works

The <rollback> section in Liquibase or Flyway creates an illusion of safety. In practice, rolling back a migration on a production database is almost impossible:

  • If a column was dropped, the data is lost
  • If data integrity was violated, a rollback won't restore the state
  • A single migration in an expand-contract chain can't be rolled back without breaking the whole chain

The real options when there's a problem:

  • Forward-fix — write a new migration that corrects the situation
  • Restore from a backup — if data was lost

This is exactly why expand-contract matters so much: each step is safe on its own and doesn't break the previous state.

squawk — a linter for migrations

squawk is a static analysis tool for SQL migrations. It finds dangerous patterns before they reach production:

  • ADD COLUMN with DEFAULT on PostgreSQL below 11
  • CREATE INDEX without CONCURRENTLY
  • ADD FOREIGN KEY without NOT VALID
  • ALTER TYPE with a table rewrite
  • column and table renames
squawk db/changelog/v0042.sql

It's worth adding squawk to a pre-commit hook and to CI — that way dangerous migrations won't pass review unnoticed.

In short

  • ALTER TABLE takes ACCESS EXCLUSIVE — it blocks everything. On a large table that's minutes with no service.
  • Start every migration with SET LOCAL lock_timeout = '3s' — better to fail fast than to hang for a long time.
  • N-1 compatibility: a migration must also work with the old code.
  • Expand-Contract — a way to make a dangerous change over 3–4 releases without stopping the service.
  • NOT NULL on an existing column — via CHECK NOT VALID + VALIDATE + SET NOT NULL.
  • ADD FOREIGN KEY — via NOT VALID + VALIDATE separately.
  • Indexes — always CONCURRENTLY with runInTransaction="false".
  • You can't remove an enum value natively — you need a new type and a shadow column.
  • A large UPDATE — in batches of 10K rows with SKIP LOCKED, ideally as a background job.
  • Rolling back a migration in production doesn't work — design for a forward-fix.
  • PostgreSQL indexes — index types, GIN, BRIN, partial indexes.
  • PostgreSQL locks — how ACCESS EXCLUSIVE and other lock levels work.
  • PostgreSQL data types — enum, jsonb, uuid, and when to choose what.