← Back to the section

When an application starts to slow down, the first question is "what's going on in the database?" Without tools, the answer is built on guesswork. PostgreSQL accumulates statistics about every query, every table, every connection — you just need to know where to look.

Top slow queries — pg_stat_statements

By default, PostgreSQL doesn't remember which queries were executed. The pg_stat_statements extension changes that: it accumulates statistics for all queries — how many times each was called, how much time it took in total and on average, how much data it read from disk.

To enable it, you need to add the extension to the configuration and restart the server:

# postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all
CREATE EXTENSION pg_stat_statements;

After that, you can look at the top queries by total time:

SELECT
    substring(query, 1, 80) AS query,
    calls,
    round(total_exec_time::numeric, 0) AS total_ms,
    round(mean_exec_time::numeric, 1) AS mean_ms,
    round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 1) AS pct
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 20;

Queries are normalized: specific values are replaced with $1, $2 — so the same query with different parameters is grouped into a single row.

If you need to measure the change over a specific period, reset the statistics at the start:

SELECT pg_stat_statements_reset();

Automatic plan for a slow query — auto_explain

pg_stat_statements shows which query is slow. But to understand why, you need the execution plan (EXPLAIN ANALYZE). The auto_explain extension does this automatically: as soon as a query exceeds the time threshold, its plan lands in the PostgreSQL log.

shared_preload_libraries = 'pg_stat_statements,auto_explain'
auto_explain.log_min_duration = '500ms'
auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain.log_format = json

The log_analyze = on option adds real execution numbers (not just the planner's estimates), but it creates extra overhead — around 5–20% on slow queries. On high-load systems, enable it only during a diagnostic window, not permanently.

Slow query log — log_min_duration_statement

The simplest way to record slow queries in the log is the log_min_duration_statement parameter. Every query that took longer than the threshold ends up in the log file with its text and execution time.

log_min_duration_statement = '500ms'
log_line_prefix = '%m [%p] %q%u@%d '

In production, a reasonable threshold is 500 ms to 1 second. A value of 100 ms creates a flood of logs that's hard to analyze.

The difference from auto_explain: this parameter records only the query text and time. auto_explain writes the full plan — that's what you need when analyzing a specific problem.

What's happening right now — pg_stat_activity

pg_stat_activity shows all active sessions: what's running, how long a transaction has been going, whether there's a lock wait.

SELECT
    pid,
    now() - xact_start AS xact_age,
    now() - query_start AS query_age,
    state,
    wait_event_type, wait_event,
    substring(query, 1, 80) AS query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY xact_age DESC LIMIT 20;

What to watch for:

  • xact_age > 5 minutes — the transaction has been hanging too long and holds resources.
  • wait_event_type = 'Lock' consistently across several processes — a lock queue.
  • state = 'idle in transaction' and xact_age > 1 minute — a connection opened a transaction and forgot to close it. This isn't just sloppiness: such a transaction blocks autovacuum and accumulates bloat.

Who is blocking whom — pg_locks

If you see wait_event_type = 'Lock', you can find the specific culprit:

SELECT
    blocked.pid AS blocked_pid,
    blocking.pid AS blocking_pid,
    blocked.query AS blocked_query,
    blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks bl_lock ON bl_lock.pid = blocked.pid AND NOT bl_lock.granted
JOIN pg_locks gr_lock ON gr_lock.locktype = bl_lock.locktype
    AND gr_lock.granted
JOIN pg_stat_activity blocking ON blocking.pid = gr_lock.pid;

Table and index health — pg_stat_user_tables

PostgreSQL updates rows not in place but by leaving old versions behind (this is MVCC). Over time, "dead" rows accumulate — bloat. pg_stat_user_tables shows how cluttered a table is and when autovacuum last ran:

SELECT relname, n_live_tup, n_dead_tup,
       last_autovacuum, last_autoanalyze,
       n_tup_upd, n_tup_hot_upd,
       seq_scan, idx_scan
FROM pg_stat_user_tables;

What matters:

  • n_dead_tup / n_live_tup — the share of dead rows. If it's growing, autovacuum isn't keeping up.
  • last_autovacuum — was it long ago? On active tables it should run every few minutes or hours.
  • n_tup_hot_upd / n_tup_upd — the higher the share of HOT updates, the more efficiently the table works (an update without changing indexes).

There's a separate table for indexes. Indexes that aren't used are candidates for removal:

SELECT indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0;   -- not used since the last statistics reset

Connections — connection pool

Each connection to PostgreSQL consumes memory and resources. If the application opens more connections than are configured via max_connections, new ones will be rejected. To check the state of connections:

SELECT state, count(*) FROM pg_stat_activity GROUP BY state;

States: active (running a query), idle (waiting for the next one), idle in transaction (in a transaction but silent), idle in transaction (aborted) (transaction aborted but not closed).

Alarm signals:

  • active close to max_connections — the pool is exhausted, subsequent queries will wait or fail.
  • idle in transaction consistently above five — somewhere in the application transactions aren't being closed.

Key metrics on the dashboard

The minimal set of metrics worth surfacing in a monitoring system:

MetricSourceWhen to watch
TPS (transactions/sec)pg_stat_database.xact_commit + xact_rollbackdrop below the usual level
Used connectionspg_stat_activity.state> 80% of max_connections
Long transactionspg_stat_activity.xact_age> 5 minutes
Replica lagpg_stat_replication.replay_lag> 30 seconds
Cache hit ratiopg_stat_database.blks_hit / (blks_hit + blks_read)< 95% — too little RAM
Checkpoint requestspg_stat_bgwriter.checkpoints_reqsteady growth
Database sizepg_database_size()unexpected jump
Last autovacuumpg_stat_user_tables.last_autovacuum> 1 day on active tables
Bloatpgstattuple> 30% on active tables

Cache hit ratio is a special metric: if it's below 95%, PostgreSQL is reading data from disk instead of cache. The first remedy is to add RAM and increase shared_buffers.

Exporting metrics to Prometheus is done via postgres_exporter — it connects to the database and publishes everything listed above in a format that Grafana and Alertmanager understand.

What to enable in production

The baseline minimum that should be everywhere:

  • pg_stat_statements — without it, you can't tell which queries are heavy.
  • log_min_duration_statement = '1s' — slow queries into the log.
  • track_activity_query_size = 4096 — the default of 1024 truncates long queries in pg_stat_activity.
  • track_io_timing = on — adds IO time to query statistics.

Additionally, if you need deeper diagnostics:

  • auto_explain with log_min_duration = '500ms' — plans of slow queries into the log.
  • log_lock_waits = on — log lock waits.
  • log_temp_files = 0 — log queries that spill temporary files to disk (a sign of insufficient memory for sorting).

How to investigate "everything is slow"

The diagnostic order when a degradation report comes in:

  1. pg_stat_activity — are there long queries or locks right now?
  2. pg_stat_statements — top 10 by total time over the last hour. What changed?
  3. pg_stat_database.blks_read rate — did disk I/O grow?
  4. pg_stat_user_tables.n_dead_tup — has bloat accumulated?
  5. pg_stat_replication — is the replica not falling behind?
  6. OSiostat, top, vmstat — is disk, CPU, or RAM maxed out?
  7. Network — is the number of connections from the application growing?

Often the cause is found in the first two steps.

Common mistakes

Not enabling pg_stat_statements — without it, you have to guess at the causes of load. This is the first thing to do after installing PostgreSQL.

Setting log_min_duration_statement = 0 — every query gets logged, the log balloons, and useful information drowns in noise.

Watching only TPS and connections — that's not enough. Long transactions and bloat kill performance gradually, without a visible TPS spike.

Keeping auto_explain.log_analyze = on permanently on loaded systems — it creates overhead. Enable it only during diagnostics.

Resetting pg_stat_statements without warning — history that colleagues rely on is lost.

In short

  • pg_stat_statements — the main tool: top queries by time, call count, IO. Requires shared_preload_libraries and a server restart.
  • auto_explain automatically writes the plan of a slow query to the log; under high traffic, enable it only for diagnostics.
  • log_min_duration_statement = '500ms–1s' in production — a quick way to catch slow queries without extra extensions.
  • pg_stat_activity shows current sessions: long transactions, locks, "forgotten" connections in idle in transaction.
  • pg_stat_user_tables — table state: bloat, last autovacuum, HOT update ratio.
  • Cache hit ratio should be above 95%; below that means not enough RAM.
  • Diagnosing "slow": activity → statements → IO → bloat → replication → OS.
  • EXPLAIN ANALYZE — how to read a query execution plan.
  • VACUUM and bloat — why dead rows accumulate and how to deal with it.
  • Locks — a detailed breakdown of pg_locks and lock types.
  • Indexes — when an index helps and when it hurts.