WAL in PostgreSQL: what it is and how to control the journal size

← Back to the section

When an application runs an INSERT or UPDATE, the data does not fly straight to disk. PostgreSQL first records the change in a special file — the WAL. This is what guarantees durability: if the server crashes, the database recovers from this journal. But the WAL is not just an "insurance policy": its size affects write speed, disk usage, and how fast replication runs.

Let's look at what happens under the hood and which developer decisions directly affect the journal size.

How PostgreSQL writes data

Imagine a notebook in which a chef writes down every recipe change before applying it in the kitchen. If the kitchen burns down, the notebook helps restore everything from the last entry. The WAL is that notebook for the database.

When you run COMMIT, the following happens:

INSERT INTO orders ...; COMMIT;

1. PostgreSQL changes a page in memory (shared buffers).
2. Writes the change to the WAL buffer (also memory).
3. On COMMIT — flushes the WAL buffer to disk (fsync).
4. Only after fsync does it answer the client "OK".
5. Later, at a checkpoint, the changed page
   is flushed to disk separately.

The key point: fsync happens on every COMMIT. This is a synchronous operation — until the disk confirms the write, the database does not respond. The speed of this fsync is the upper bound on write speed.

How much WAL each operation generates

Here is where many people are surprised by the behavior of UPDATE. In most databases an update writes only the changed field. PostgreSQL writes the entire new row.

If a row has 30 columns and you update one, all 30 go into the WAL. This is due to how MVCC works: every update is a new physical instance of the row, not a patch on top of the old one.

Approximate WAL record sizes:

Operation	What is written to WAL
`INSERT` of one row	~50–200 bytes + column values
`UPDATE` (regular)	~50 bytes + the entire new row
`UPDATE` with HOT	~50 bytes + only the changed fields
`DELETE`	~50 bytes + row identifier
`CREATE INDEX` without CONCURRENTLY	the entire index
`VACUUM`	list of freed row versions

Note the UPDATE with HOT — this is a special mode that we will cover below. It dramatically changes the picture for frequently updated tables.

HOT and fillfactor: the main lever for update-heavy load

A regular UPDATE creates a new row version — and if the indexed columns did not change, PostgreSQL still updates the indexes. That is double the work: more WAL and slower indexes.

HOT (Heap-Only Tuple) is an optimization in which the new row version references the old one right inside the page, without involving indexes. Conditions:

The new row version fits on the same memory page.
The change did not touch any indexed column.

If both conditions are met, PostgreSQL does not touch the indexes and writes only the delta to the WAL. This can cut WAL by several times.

The problem: with fillfactor = 100 (the default), the page is packed to the brim, and there is no room for the new row version next to the old one. HOT does not kick in.

The fix is fillfactor 80–90: leave 10–20% of the page free specifically for updates.

-- when creating the table
CREATE TABLE orders (...) WITH (fillfactor = 90);

-- for an existing table
ALTER TABLE orders SET (fillfactor = 80);
VACUUM FULL orders;  -- rewrites the data with the new fillfactor

Check how often HOT kicks in:

SELECT relname,
       n_tup_upd,
       n_tup_hot_upd,
       round(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 1) AS hot_pct
FROM pg_stat_user_tables
WHERE n_tup_upd > 1000
ORDER BY n_tup_upd DESC
LIMIT 10;

If hot_pct is below 50% on an actively updated table, it is worth lowering fillfactor. The cost is 10–20% of extra disk space; the payoff is a multiple reduction in WAL and index load.

An important limitation: HOT does not work if the UPDATE changes a column that has an index. If such a column changes often and the index is weakly selective, consider dropping that index.

COPY instead of INSERT: 10–100x faster for bulk loading

Suppose you need to load a million rows. Intuitively people write a loop of INSERTs:

INSERT INTO orders (id, customer_id, total) VALUES (...);
-- repeated a million times

Each INSERT is a separate transaction, a separate fsync, a separate network round-trip. A million rows = a million fsyncs. This is very slow.

Much more efficient: batch the rows within a single transaction. Even more efficient — use COPY, purpose-built for bulk loading:

COPY orders (id, customer_id, total) FROM STDIN;

COPY streams the data in one flow, generates minimal WAL, and gets by with a single fsync for the whole load. On a million rows the difference is 10–100x in time and 5–10x in WAL volume.

How to use COPY from code:

import java.io.Reader;
import java.io.StringReader;
import java.sql.Connection;
import org.postgresql.copy.CopyManager;
import org.postgresql.core.BaseConnection;

void copyInsert(Connection conn, String csvData) throws Exception {
    CopyManager cm = new CopyManager((BaseConnection) conn);
    try (Reader reader = new StringReader(csvData)) {
        cm.copyIn("COPY orders (id, customer_id, total) FROM STDIN WITH CSV", reader);
    }
}

import (
    "context"
    "github.com/jackc/pgx/v5"
    "github.com/jackc/pgx/v5/pgxpool"
)

// pgx.CopyFrom is translated into COPY STDIN
func bulkInsert(ctx context.Context, pool *pgxpool.Pool, orders []Order) (int64, error) {
    rows := pgx.CopyFromSlice(len(orders), func(i int) ([]any, error) {
        return []any{orders[i].ID, orders[i].CustomerID, orders[i].Total}, nil
    })
    return pool.CopyFrom(ctx,
        pgx.Identifier{"orders"},
        []string{"id", "customer_id", "total"},
        rows,
    )
}

import { Pool } from 'pg';
import { from as copyFrom } from 'pg-copy-streams';
import { Readable } from 'stream';
import { pipeline } from 'stream/promises';

const pool = new Pool({ connectionString: process.env.DATABASE_URL });

async function copyInsert(csvData: string): Promise<void> {
    const client = await pool.connect();
    try {
        const stream = client.query(
            copyFrom('COPY orders (id, customer_id, total) FROM STDIN WITH CSV'),
        );
        await pipeline(Readable.from([csvData]), stream);
    } finally {
        client.release();
    }
}

import asyncpg

async def bulk_insert(pool: asyncpg.Pool, orders: list[dict]) -> None:
    records = [(o["id"], o["customer_id"], o["total"]) for o in orders]
    async with pool.acquire() as conn:
        await conn.copy_records_to_table(
            "orders",
            records=records,
            columns=["id", "customer_id", "total"],
        )

If COPY is not suitable for some reason, at least batch the rows into groups of 1–10 thousand within a single transaction. One large COMMIT with 10,000 inserts is cheaper than 10,000 small ones.

Another trick for bulk loading: drop the indexes before loading, load, then recreate the indexes. Each insert with an active index writes to the WAL not only the table row but also an update to every index.

DROP INDEX ix_orders_customer;
COPY orders FROM '/tmp/data.csv';
CREATE INDEX CONCURRENTLY ix_orders_customer ON orders (customer_id);

UNLOGGED tables: zero WAL for temporary data

Some data does not require durability. Session caches, intermediate ETL results, task queues with a short lifetime — if the server crashes, they are easy to rebuild.

For such data PostgreSQL offers UNLOGGED tables: changes in them are not written to the WAL at all.

CREATE UNLOGGED TABLE session_cache (
    session_id uuid PRIMARY KEY,
    payload    jsonb NOT NULL,
    expires_at timestamptz NOT NULL
);

The flip side: such tables are not replicated, and on an abnormal PostgreSQL shutdown their contents are truncated. For business data this is unacceptable. For caches and temporary tables it is an excellent choice.

TOAST: why large fields are not always expensive

PostgreSQL automatically stores large values (long text, JSON) in a separate TOAST table. An important detail: if a large value did not change during an UPDATE, it is not rewritten to the WAL.

-- update only the view counter
UPDATE article SET view_count = view_count + 1 WHERE id = ?;
-- the body field (50 KB of text) is stored in TOAST and is NOT written to WAL

This is an argument against stuffing everything into one large jsonb field: if you update one small key inside a JSONB, PostgreSQL does not know that the rest is unchanged and writes the whole object to the WAL.

synchronous_commit: speed vs durability

By default PostgreSQL waits for confirmation from the disk before each COMMIT. This guarantees that data will not be lost even in a crash.

The parameter synchronous_commit = off removes this wait: the database answers "OK" immediately, and the data reaches the disk with a delay of about 200 ms. This noticeably increases write throughput.

The cost: if the database crashes within those 200 ms, the last committed transactions may be lost. Database consistency is not violated by this — only the data from that window is lost.

-- enable only for a specific operation
SET LOCAL synchronous_commit = off;
INSERT INTO metrics_event ...;
COMMIT;

Applicable for metrics, analytics, logs. Never — for financial operations and orders.

Long transactions: the silent killer

PostgreSQL cannot delete WAL files while they are needed by at least one open transaction. If a transaction hangs for an hour, the WAL for that hour goes nowhere and the disk fills up.

That same open transaction blocks VACUUM: it cannot remove old row versions that this transaction can still see. Tables bloat.

Typical causes of long transactions:

the transaction spans a network call (an HTTP request to another service, a write to S3, sending to a queue);
a developer opened a transaction in the console and forgot to close it;
a batch job processes a million rows without intermediate commits.

View the current long transactions:

SELECT pid,
       age(now(), xact_start) AS xact_age,
       state,
       query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_age DESC
LIMIT 10;

Rule: a transaction should take seconds, not minutes. Set up an alert on xact_age > 5 min.

Replication slot: WAL can accumulate indefinitely

A replication slot is a mechanism that guarantees PostgreSQL will not delete WAL until a subscriber (a replica or a logical consumer) has read it. This is convenient but dangerous.

If a replica falls off or a logical consumer stops, WAL will pile up indefinitely until the disk runs out of space.

Check the state of the slots:

SELECT slot_name,
       active,
       pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes
FROM pg_replication_slots
ORDER BY lag_bytes DESC;

Set up an alert: lag_bytes > 10 GB or active = false for longer than an hour.

In PostgreSQL 13+ there is a parameter max_slot_wal_keep_size: if a slot falls behind by more than the given limit, PostgreSQL deletes the WAL (the slot breaks, but the cluster does not go down).

WAL monitoring

A few queries worth adding to monitoring:

-- size of the pg_wal/ directory
SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();

-- WAL write statistics
SELECT * FROM pg_stat_wal;
-- wal_records, wal_bytes, wal_buffers_full, wal_write_time, wal_sync_time

What to watch for:

the size of pg_wal/ — alert when a threshold is exceeded;
growth of wal_bytes — a sign of new load;
growth of wal_buffers_full — a signal to increase the wal_buffers parameter;
HOT ratio below 50% on hot tables;
transactions longer than 5 minutes;
replication slot lag.

In short

WAL is a change journal written before the answer to COMMIT. The fsync speed determines the maximum write speed.
UPDATE writes the entire new row to the WAL, not just the changed field — unless HOT kicks in.
HOT cuts WAL down to the delta: you need free space on the page and no changes to indexed columns. Enabled via fillfactor 80–90.
COPY is 10–100x faster than a loop of INSERTs in time and 5–10x in WAL.
UNLOGGED tables write no WAL at all — for caches and temporary data this is fine.
Large TOAST fields are not rewritten to the WAL if they did not change.
synchronous_commit = off increases throughput, but data from a 200 ms window may be lost in a crash. Only for non-critical data.
Long transactions hold WAL and block VACUUM. A transaction should take seconds, not minutes.
A hanging replication slot accumulates WAL indefinitely. You need an alert on the lag.