When an application starts to slow down, the first question is "what's going on in the database?" Without tools, the answer is built on guesswork. PostgreSQL accumulates statistics about every query, every table, every connection — you just need to know where to look.
Top slow queries — pg_stat_statements
By default, PostgreSQL doesn't remember which queries were executed. The pg_stat_statements extension changes that: it accumulates statistics for all queries — how many times each was called, how much time it took in total and on average, how much data it read from disk.
To enable it, you need to add the extension to the configuration and restart the server:
# postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all
CREATE EXTENSION pg_stat_statements;
After that, you can look at the top queries by total time:
SELECT
substring(query, 1, 80) AS query,
calls,
round(total_exec_time::numeric, 0) AS total_ms,
round(mean_exec_time::numeric, 1) AS mean_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 1) AS pct
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 20;
Queries are normalized: specific values are replaced with $1, $2 — so the same query with different parameters is grouped into a single row.
If you need to measure the change over a specific period, reset the statistics at the start:
SELECT pg_stat_statements_reset();
Automatic plan for a slow query — auto_explain
pg_stat_statements shows which query is slow. But to understand why, you need the execution plan (EXPLAIN ANALYZE). The auto_explain extension does this automatically: as soon as a query exceeds the time threshold, its plan lands in the PostgreSQL log.
shared_preload_libraries = 'pg_stat_statements,auto_explain'
auto_explain.log_min_duration = '500ms'
auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain.log_format = json
The log_analyze = on option adds real execution numbers (not just the planner's estimates), but it creates extra overhead — around 5–20% on slow queries. On high-load systems, enable it only during a diagnostic window, not permanently.
Slow query log — log_min_duration_statement
The simplest way to record slow queries in the log is the log_min_duration_statement parameter. Every query that took longer than the threshold ends up in the log file with its text and execution time.
log_min_duration_statement = '500ms'
log_line_prefix = '%m [%p] %q%u@%d '
In production, a reasonable threshold is 500 ms to 1 second. A value of 100 ms creates a flood of logs that's hard to analyze.
The difference from auto_explain: this parameter records only the query text and time. auto_explain writes the full plan — that's what you need when analyzing a specific problem.
What's happening right now — pg_stat_activity
pg_stat_activity shows all active sessions: what's running, how long a transaction has been going, whether there's a lock wait.
SELECT
pid,
now() - xact_start AS xact_age,
now() - query_start AS query_age,
state,
wait_event_type, wait_event,
substring(query, 1, 80) AS query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY xact_age DESC LIMIT 20;
What to watch for:
xact_age > 5 minutes— the transaction has been hanging too long and holds resources.wait_event_type = 'Lock'consistently across several processes — a lock queue.state = 'idle in transaction'andxact_age > 1 minute— a connection opened a transaction and forgot to close it. This isn't just sloppiness: such a transaction blocks autovacuum and accumulates bloat.
Who is blocking whom — pg_locks
If you see wait_event_type = 'Lock', you can find the specific culprit:
SELECT
blocked.pid AS blocked_pid,
blocking.pid AS blocking_pid,
blocked.query AS blocked_query,
blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks bl_lock ON bl_lock.pid = blocked.pid AND NOT bl_lock.granted
JOIN pg_locks gr_lock ON gr_lock.locktype = bl_lock.locktype
AND gr_lock.granted
JOIN pg_stat_activity blocking ON blocking.pid = gr_lock.pid;
Table and index health — pg_stat_user_tables
PostgreSQL updates rows not in place but by leaving old versions behind (this is MVCC). Over time, "dead" rows accumulate — bloat. pg_stat_user_tables shows how cluttered a table is and when autovacuum last ran:
SELECT relname, n_live_tup, n_dead_tup,
last_autovacuum, last_autoanalyze,
n_tup_upd, n_tup_hot_upd,
seq_scan, idx_scan
FROM pg_stat_user_tables;
What matters:
n_dead_tup / n_live_tup— the share of dead rows. If it's growing, autovacuum isn't keeping up.last_autovacuum— was it long ago? On active tables it should run every few minutes or hours.n_tup_hot_upd / n_tup_upd— the higher the share of HOT updates, the more efficiently the table works (an update without changing indexes).
There's a separate table for indexes. Indexes that aren't used are candidates for removal:
SELECT indexrelname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0; -- not used since the last statistics reset
Connections — connection pool
Each connection to PostgreSQL consumes memory and resources. If the application opens more connections than are configured via max_connections, new ones will be rejected. To check the state of connections:
SELECT state, count(*) FROM pg_stat_activity GROUP BY state;
States: active (running a query), idle (waiting for the next one), idle in transaction (in a transaction but silent), idle in transaction (aborted) (transaction aborted but not closed).
Alarm signals:
activeclose tomax_connections— the pool is exhausted, subsequent queries will wait or fail.idle in transactionconsistently above five — somewhere in the application transactions aren't being closed.
Key metrics on the dashboard
The minimal set of metrics worth surfacing in a monitoring system:
| Metric | Source | When to watch |
|---|---|---|
| TPS (transactions/sec) | pg_stat_database.xact_commit + xact_rollback | drop below the usual level |
| Used connections | pg_stat_activity.state | > 80% of max_connections |
| Long transactions | pg_stat_activity.xact_age | > 5 minutes |
| Replica lag | pg_stat_replication.replay_lag | > 30 seconds |
| Cache hit ratio | pg_stat_database.blks_hit / (blks_hit + blks_read) | < 95% — too little RAM |
| Checkpoint requests | pg_stat_bgwriter.checkpoints_req | steady growth |
| Database size | pg_database_size() | unexpected jump |
| Last autovacuum | pg_stat_user_tables.last_autovacuum | > 1 day on active tables |
| Bloat | pgstattuple | > 30% on active tables |
Cache hit ratio is a special metric: if it's below 95%, PostgreSQL is reading data from disk instead of cache. The first remedy is to add RAM and increase shared_buffers.
Exporting metrics to Prometheus is done via postgres_exporter — it connects to the database and publishes everything listed above in a format that Grafana and Alertmanager understand.
What to enable in production
The baseline minimum that should be everywhere:
pg_stat_statements— without it, you can't tell which queries are heavy.log_min_duration_statement = '1s'— slow queries into the log.track_activity_query_size = 4096— the default of 1024 truncates long queries inpg_stat_activity.track_io_timing = on— adds IO time to query statistics.
Additionally, if you need deeper diagnostics:
auto_explainwithlog_min_duration = '500ms'— plans of slow queries into the log.log_lock_waits = on— log lock waits.log_temp_files = 0— log queries that spill temporary files to disk (a sign of insufficient memory for sorting).
How to investigate "everything is slow"
The diagnostic order when a degradation report comes in:
pg_stat_activity— are there long queries or locks right now?pg_stat_statements— top 10 by total time over the last hour. What changed?pg_stat_database.blks_readrate — did disk I/O grow?pg_stat_user_tables.n_dead_tup— has bloat accumulated?pg_stat_replication— is the replica not falling behind?- OS —
iostat,top,vmstat— is disk, CPU, or RAM maxed out? - Network — is the number of connections from the application growing?
Often the cause is found in the first two steps.
Common mistakes
Not enabling pg_stat_statements — without it, you have to guess at the causes of load. This is the first thing to do after installing PostgreSQL.
Setting log_min_duration_statement = 0 — every query gets logged, the log balloons, and useful information drowns in noise.
Watching only TPS and connections — that's not enough. Long transactions and bloat kill performance gradually, without a visible TPS spike.
Keeping auto_explain.log_analyze = on permanently on loaded systems — it creates overhead. Enable it only during diagnostics.
Resetting pg_stat_statements without warning — history that colleagues rely on is lost.
In short
pg_stat_statements— the main tool: top queries by time, call count, IO. Requiresshared_preload_librariesand a server restart.auto_explainautomatically writes the plan of a slow query to the log; under high traffic, enable it only for diagnostics.log_min_duration_statement = '500ms–1s'in production — a quick way to catch slow queries without extra extensions.pg_stat_activityshows current sessions: long transactions, locks, "forgotten" connections inidle in transaction.pg_stat_user_tables— table state: bloat, last autovacuum, HOT update ratio.- Cache hit ratio should be above 95%; below that means not enough RAM.
- Diagnosing "slow": activity → statements → IO → bloat → replication → OS.
What to read next
- EXPLAIN ANALYZE — how to read a query execution plan.
- VACUUM and bloat — why dead rows accumulate and how to deal with it.
- Locks — a detailed breakdown of
pg_locksand lock types. - Indexes — when an index helps and when it hurts.