PostgreSQL replication: read-replica and query routing

← Back to the section

The application grows, and the primary database starts choking on heavy SELECTs — reports, analytics, search. While an analytical query scans millions of rows, ordinary OLTP operations wait on locks and slow down. The fix is to put a replica next to it and route reads there.

How streaming replication works

PostgreSQL writes every change to the WAL (Write-Ahead Log). The replica continuously receives this log from the primary and replays it on its side — this way the data on the replica mirrors the data on the primary.

A few concepts worth knowing:

Master (primary) — the single node that accepts writes: INSERT, UPDATE, DELETE.
Replica (standby, hot standby) — replays WAL from the primary, answers reads only.
Replication lag — the delay between a write on the primary and the data appearing on the replica. Under normal conditions — 50–500 milliseconds; under load or with large transactions — up to several seconds.

By default, replication is asynchronous: the primary does not wait for confirmation from the replica before responding to the client. This is fast, but it means the replica lags slightly behind.

Why you need a read-replica

Three main scenarios:

Offloading the primary. Heavy SELECTs move to the replica and don't interfere with OLTP operations. Analogy: opening a second checkout lane for slow shoppers so the fast queue doesn't stall.

High availability (HA). If the primary goes down, the replica can be promoted to a new primary (failover). No data is lost, and the application keeps working.

Geo-distribution. A replica is brought up in another data center or region, close to users — read latency drops.

What you shouldn't do with a replica: read data right after a write expecting a fresh result — the replica lags behind and may not know about the row you just inserted. More on this below.

Query routing

The application keeps two connection pools — one to the primary, the other to the replica. Queries within a read-only transaction go to the replica, the rest go to the primary.

An important subtlety: the choice of source must be deferred until the first query, not until the connection is opened. Otherwise the "read-only" flag isn't known yet and routing won't work correctly.

// HikariCP + AbstractRoutingDataSource + LazyConnectionDataSourceProxy
public enum DataSourceType { MASTER, REPLICA }

@Component
public class TransactionRoutingDataSource extends AbstractRoutingDataSource {
    @Override
    protected Object determineCurrentLookupKey() {
        return TransactionSynchronizationManager.isCurrentTransactionReadOnly()
            ? DataSourceType.REPLICA
            : DataSourceType.MASTER;
    }
}

@Configuration
public class DataSourceConfig {

    @Bean @Primary
    public DataSource routingDataSource(DataSource master, DataSource replica) {
        var routing = new TransactionRoutingDataSource();
        routing.setTargetDataSources(Map.of(
            DataSourceType.MASTER, master,
            DataSourceType.REPLICA, replica
        ));
        routing.setDefaultTargetDataSource(master);
        return new LazyConnectionDataSourceProxy(routing);
    }
}

// Usage:
@Transactional(readOnly = true)
public List<OrderView> findOrders(long customerId) {
    // goes to the replica
}

@Transactional
public OrderId createOrder(CreateOrderCommand cmd) {
    // goes to the primary
}

// pgxpool: two pools, selected via context
type DB struct {
    Master  *pgxpool.Pool
    Replica *pgxpool.Pool
}

type ctxKey string
const readOnlyKey ctxKey = "readOnly"

func WithReadOnly(ctx context.Context) context.Context {
    return context.WithValue(ctx, readOnlyKey, true)
}

func (db *DB) Pool(ctx context.Context) *pgxpool.Pool {
    if v, ok := ctx.Value(readOnlyKey).(bool); ok && v {
        return db.Replica
    }
    return db.Master
}

// Usage:
func (r *OrderRepo) FindOrders(ctx context.Context, customerID int64) ([]Order, error) {
    rows, err := r.db.Pool(WithReadOnly(ctx)).Query(ctx,
        "SELECT id, status FROM orders WHERE customer_id = $1", customerID)
    // ...
}

func (r *OrderRepo) CreateOrder(ctx context.Context, cmd CreateOrderCmd) (int64, error) {
    var id int64
    err := r.db.Pool(ctx).QueryRow(ctx,
        "INSERT INTO orders (customer_id) VALUES ($1) RETURNING id", cmd.CustomerID,
    ).Scan(&id)
    return id, err
}

// node-postgres (pg): two Pools, selected by a function
const pools = {
    master:  new Pool({ connectionString: process.env.DB_MASTER_URL }),
    replica: new Pool({ connectionString: process.env.DB_REPLICA_URL }),
};

function getPool(readOnly: boolean): Pool {
    return readOnly ? pools.replica : pools.master;
}

// Usage:
export async function findOrders(customerId: bigint): Promise<Order[]> {
    const { rows } = await getPool(true).query<Order>(
        'SELECT id, status FROM orders WHERE customer_id = $1',
        [customerId],
    );
    return rows;
}

export async function createOrder(cmd: CreateOrderCmd): Promise<bigint> {
    const { rows } = await getPool(false).query<{ id: bigint }>(
        'INSERT INTO orders (customer_id) VALUES ($1) RETURNING id',
        [cmd.customerId],
    );
    return rows[0].id;
}

# psycopg (v3): two pools via AsyncConnectionPool
master_pool  = AsyncConnectionPool(conninfo=MASTER_DSN, open=False)
replica_pool = AsyncConnectionPool(conninfo=REPLICA_DSN, open=False)

def get_pool(read_only: bool) -> AsyncConnectionPool:
    return replica_pool if read_only else master_pool

# Usage:
async def find_orders(customer_id: int) -> list[dict]:
    async with get_pool(read_only=True).connection() as conn:
        async with conn.cursor() as cur:
            await cur.execute(
                "SELECT id, status FROM orders WHERE customer_id = %s",
                (customer_id,),
            )
            return await cur.fetchall()

async def create_order(customer_id: int) -> int:
    async with get_pool(read_only=False).connection() as conn:
        async with conn.cursor() as cur:
            await cur.execute(
                "INSERT INTO orders (customer_id) VALUES (%s) RETURNING id",
                (customer_id,),
            )
            row = await cur.fetchone()
            return row[0]

Read-after-write: a common trap

A typical scenario: a user creates an order, and the application immediately shows them the list of orders. The write goes to the primary, while the read goes to the replica, which hasn't received the fresh WAL yet. The order that was just created won't appear in the response.

createOrder(req)   → primary  ✓
listOrders(userId) → replica  ✗  (the order may be missing)

Three ways to work around this:

Read from the primary after a write

The simplest option for pages where the user expects fresh data immediately after their action.

@Transactional   // no readOnly=true — will go to the primary
public List<Order> myOrdersFromMaster(long customerId) {
    return orderRepo.findByCustomerId(customerId);
}

// ctx without WithReadOnly — the primary pool is selected
func (r *OrderRepo) MyOrdersFromMaster(ctx context.Context, customerID int64) ([]Order, error) {
    rows, err := r.db.Pool(ctx).Query(ctx,
        "SELECT id, status FROM orders WHERE customer_id = $1", customerID)
    // ...
}

// getPool(false) — explicitly the primary
export async function myOrdersFromMaster(customerId: bigint): Promise<Order[]> {
    const { rows } = await getPool(false).query<Order>(
        'SELECT id, status FROM orders WHERE customer_id = $1',
        [customerId],
    );
    return rows;
}

# read_only=False — explicitly the primary
async def my_orders_from_master(customer_id: int) -> list[dict]:
    async with get_pool(read_only=False).connection() as conn:
        async with conn.cursor() as cur:
            await cur.execute(
                "SELECT id, status FROM orders WHERE customer_id = %s",
                (customer_id,),
            )
            return await cur.fetchall()

Return the data directly from the write operation

The data is already in memory after the INSERT — there's no need for a separate SELECT. RETURNING in PostgreSQL returns the inserted row right within the same transaction on the primary.

// jOOQ: INSERT ... RETURNING returns the record from the primary
public OrderResponse createOrder(CreateOrderCommand cmd) {
    OrdersRecord saved = dsl
        .insertInto(ORDERS)
        .set(ORDERS.CUSTOMER_ID, cmd.customerId())
        .returning()
        .fetchOne();
    return OrderResponse.from(saved);
}

func (r *OrderRepo) CreateOrder(ctx context.Context, cmd CreateOrderCmd) (*Order, error) {
    var o Order
    err := r.db.Pool(ctx).QueryRow(ctx,
        `INSERT INTO orders (customer_id) VALUES ($1)
         RETURNING id, customer_id, created_at`,
        cmd.CustomerID,
    ).Scan(&o.ID, &o.CustomerID, &o.CreatedAt)
    return &o, err
}

export async function createOrder(cmd: CreateOrderCmd): Promise<Order> {
    const { rows } = await getPool(false).query<Order>(
        `INSERT INTO orders (customer_id) VALUES ($1)
         RETURNING id, customer_id, created_at`,
        [cmd.customerId],
    );
    return rows[0];
}

async def create_order(customer_id: int) -> dict:
    async with get_pool(read_only=False).connection() as conn:
        async with conn.cursor(row_factory=dict_row) as cur:
            await cur.execute(
                """INSERT INTO orders (customer_id) VALUES (%s)
                   RETURNING id, customer_id, created_at""",
                (customer_id,),
            )
            return await cur.fetchone()

Wait until the replica catches up to the primary

A more complex approach: after a write, get the current WAL position on the primary (LSN) and poll the replica until it has replayed up to that position. Suitable for rare, specific cases where neither the first nor the second option applies.

String lsn = masterJdbc.queryForObject(
    "SELECT pg_current_wal_lsn()", String.class);

do {
    String replayLsn = replicaJdbc.queryForObject(
        "SELECT pg_last_wal_replay_lsn()", String.class);
    if (lsnGte(replayLsn, lsn)) break;
    Thread.sleep(50);
} while (true);

func waitForReplica(ctx context.Context, db *DB, lsn string) error {
    for {
        var replayLSN string
        err := db.Replica.QueryRow(ctx,
            "SELECT pg_last_wal_replay_lsn()").Scan(&replayLSN)
        if err != nil {
            return err
        }
        if lsnGte(replayLSN, lsn) {
            return nil
        }
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-time.After(50 * time.Millisecond):
        }
    }
}

async function waitForReplica(lsn: string, timeoutMs = 5000): Promise<void> {
    const deadline = Date.now() + timeoutMs;
    while (Date.now() < deadline) {
        const { rows } = await getPool(true).query<{ replay: string }>(
            'SELECT pg_last_wal_replay_lsn() AS replay',
        );
        if (lsnGte(rows[0].replay, lsn)) return;
        await new Promise(r => setTimeout(r, 50));
    }
    throw new Error('replica catch-up timeout');
}

async def wait_for_replica(lsn: str, timeout: float = 5.0) -> None:
    loop = asyncio.get_running_loop()
    deadline = loop.time() + timeout
    while loop.time() < deadline:
        async with get_pool(read_only=True).connection() as conn:
            async with conn.cursor() as cur:
                await cur.execute("SELECT pg_last_wal_replay_lsn()")
                (replay_lsn,) = await cur.fetchone()
        if lsn_gte(str(replay_lsn), lsn):
            return
        await asyncio.sleep(0.05)
    raise TimeoutError("replica catch-up timeout")

Synchronous replication

By default, the primary responds to the client right after writing to the WAL, without waiting for the replica. You can enable synchronous mode:

# postgresql.conf on the primary
synchronous_commit = on
synchronous_standby_names = 'replica1'

In this mode, the primary waits for confirmation from the replica before responding to COMMIT. The guarantee is stronger, but the price is that the latency of every transaction increases by a network round-trip plus the replica's fsync (1–5 ms locally, tens of milliseconds when geo-distributed).

For most tasks, synchronous replication isn't needed: the asynchronous scheme with proper routing covers 99% of cases. Sync makes sense only where the data is critically important and losing even a millisecond's worth of writes is unacceptable.

Failover

If the primary goes down, tools like Patroni or repmgr detect it and promote the replica to a new primary. DNS or a load balancer switches to the new address, and the connection pools reconnect.

During the switchover (usually 10–60 seconds) writes fail with errors. For critical operations, it's worth adding a retry with exponential backoff:

@Retryable(
    retryFor = SQLException.class,
    maxAttempts = 5,
    backoff = @Backoff(delay = 1000, multiplier = 2)
)
public OrderId createOrder(CreateOrderCommand cmd) {
    // ...
}

func withRetry(ctx context.Context, maxAttempts int, fn func() error) error {
    delay := time.Second
    for attempt := range maxAttempts {
        err := fn()
        if err == nil {
            return nil
        }
        if attempt == maxAttempts-1 {
            return err
        }
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-time.After(delay):
            delay *= 2
        }
    }
    return nil
}

async function withRetry<T>(
    fn: () => Promise<T>,
    maxAttempts = 5,
    delayMs = 1000,
): Promise<T> {
    for (let attempt = 0; attempt < maxAttempts; attempt++) {
        try {
            return await fn();
        } catch (err) {
            if (attempt === maxAttempts - 1) throw err;
            await new Promise(r => setTimeout(r, delayMs * 2 ** attempt));
        }
    }
    throw new Error('unreachable');
}

from tenacity import retry, stop_after_attempt, wait_exponential
from psycopg import OperationalError

@retry(
    retry=retry_if_exception_type(OperationalError),
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=1, min=1, max=16),
)
async def create_order(customer_id: int) -> int:
    async with get_pool(read_only=False).connection() as conn:
        async with conn.cursor() as cur:
            await cur.execute(
                "INSERT INTO orders (customer_id) VALUES (%s) RETURNING id",
                (customer_id,),
            )
            row = await cur.fetchone()
            return row[0]

Logical replication

Besides streaming replication, PostgreSQL also has logical replication. It copies not the entire WAL stream but changes for specific tables — you can replicate a subset of tables, change the schema, and route data into another system.

Typical uses:

Piping data from PostgreSQL into an analytical store or Kafka.
Online migration between two PostgreSQL instances.
Multi-master with conflict resolution.

For the "offload the primary via a read-replica" task, ordinary streaming replication is a better fit — it's simpler and faster. Logical has higher overhead.

Monitoring replica lag

You can inspect replica lag directly in PostgreSQL.

On the primary — the state of all replicas:

SELECT application_name, state, replay_lag
FROM pg_stat_replication;

On the replica itself — how much time has passed since the last replayed transaction:

SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;

It's worth setting up an alert if the lag exceeds 30 seconds or if more than 1 GB of WAL has accumulated in the queue — that's a sign of a performance or network problem.

In short

Streaming replication: the primary writes WAL, the replica replays it. Normal lag — 50–500 ms.
The replica takes the load of heavy SELECTs off the primary and serves as a backup in case of failure.
The application keeps two connection pools; the primary/replica choice is made by the read-only flag.
The choice of source must be deferred until the first query, not until the connection is opened.
Read-after-write through the replica doesn't work: the replica lags behind. The fixes are — read from the primary, return data via RETURNING, or wait for catch-up by LSN.
Synchronous replication slows down every COMMIT — it's needed only in rare, critical cases.
Monitoring: pg_stat_replication on the primary, pg_last_xact_replay_timestamp() on the replica, alert when lag > 30 sec.