← Back to the section

A query runs slowly, but it's not clear why. Add an index? Rewrite the JOIN? Increase memory? One command answers all these questions — EXPLAIN ANALYZE. It shows exactly what PostgreSQL did while running the query and how much time it spent on each step.

How to run it

The minimal option is to just add EXPLAIN ANALYZE before the query:

EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'PAID';

The best option is to add BUFFERS so you can see how much data came from cache and how much from disk:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE status = 'PAID';

An important point: EXPLAIN ANALYZE actually runs the query. This is needed for an accurate measurement. If you want to check the plan for an UPDATE or DELETE without applying the changes, wrap it in a transaction:

BEGIN;
EXPLAIN ANALYZE UPDATE orders SET status = 'X' WHERE id = 123;
ROLLBACK;

How to read the output

Here is a sample output:

Hash Join  (cost=12.34..567.89 rows=1000 width=64) (actual time=1.234..56.789 rows=987 loops=1)
  Hash Cond: (a.id = b.a_id)
  Buffers: shared hit=1234 read=56
  ->  Seq Scan on a
  ->  Hash
        ->  Seq Scan on b

What each part means:

  • cost=START..TOTAL — PostgreSQL's estimate in arbitrary units. These are not milliseconds.
  • rows=N — how many rows PostgreSQL expected to get.
  • actual time=START..TOTAL — the real time in milliseconds.
  • actual rows=N loops=M — how many rows the node actually returned and how many times it ran.
  • Buffers: shared hit=X read=Yhit is data from cache, read is disk access.

You need to read the plan bottom-up. The lower nodes are the inner operations that run first. The root is the final result.

The first thing to look at is the discrepancy between rows= (estimate) and actual rows= (fact). If the estimate is ten or more times smaller than the real number of rows, PostgreSQL is building the plan on wrong data. The ANALYZE command on the table helps — it updates the statistics.

The second important point is loops. If a node ran many times, the real time needs to be multiplied: loops=10000 × actual time=0.5..1.2 means 12 seconds inside a single Nested Loop.

Table scan nodes

PostgreSQL chooses how to read a table based on the size of the result set, the presence of indexes and the cost settings.

Seq Scan — a full pass

Seq Scan on orders
  Filter: (status = 'PAID')
  Rows Removed by Filter: 850000

PostgreSQL reads the whole table and discards the unneeded rows. This is fine for small tables and for queries that select most of the rows. It's bad when Rows Removed by Filter is millions of times larger than the result — that's a sign you need an index on the filter field.

Index Scan — a pass over the index

Index Scan using ix_orders_status on orders
  Index Cond: (status = 'PAID')

PostgreSQL walks the B-tree index, then reads the needed rows from the table. Pay attention to the difference between Index Cond: and Filter:. If a condition ended up in Filter: rather than Index Cond:, that means the index on this field is not being used as a search key — and the rows are still being scanned one by one.

Index Only Scan — index only, no table

Index Only Scan using ix_orders_status_created on orders
  Index Cond: (status = 'PAID')
  Heap Fetches: 12

If all the needed columns are in the index, PostgreSQL may not touch the table at all. This is the fastest option.

If Heap Fetches is greater than zero, PostgreSQL still goes to the table — the visibility map is stale and doesn't allow skipping the check. VACUUM on the table helps.

Bitmap Scan — a two-phase read

Bitmap Heap Scan on orders
  Recheck Cond: (status = 'PAID')
  ->  Bitmap Index Scan on ix_orders_status

A two-phase process: first a bitmap of the pages that contain the needed rows is built, then those pages are read in order. This is more efficient than random reads for a medium-sized result set. A Bitmap also lets you combine several indexes (BitmapAnd, BitmapOr).

Recheck Cond: means the condition is re-checked — this happens when the bitmap stores only pages, not specific rows (a lossy bitmap due to lack of memory).

Joining tables

Nested Loop — a loop inside a loop

Nested Loop  (rows=100 loops=1)
  ->  Index Scan on a  (rows=100)
  ->  Index Scan on b  (rows=1 loops=100)
        Index Cond: (b.a_id = a.id)

For each row from the outer table, a match is looked up in the inner one. This works well when the outer result is small and the inner table has an index on the join key.

A common trap: loops=1000000 at actual time=0.5 — that's 500 seconds inside a single node. If the inner table is large and has no index, a Nested Loop degrades.

Hash Join — a hash table in memory

Hash Join
  Hash Cond: (a.id = b.a_id)
  ->  Seq Scan on a
  ->  Hash
        ->  Seq Scan on b

The smaller table is loaded into a hash table in memory, then for each row from the larger one a fast lookup is done. Good for joining two large tables without suitable indexes.

If the output shows Batches: 2 or more — the hash table didn't fit in memory and PostgreSQL spilled to disk. Increasing work_mem for the session helps:

SET work_mem = '64MB';
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

Merge Join — merging two sorted streams

Merge Join
  Merge Cond: (a.id = b.a_id)
  ->  Sort ...
  ->  Index Scan on ix_b_a_id

Both streams are sorted by the join key and merged like a zipper. Efficient for large volumes of data when both streams are already sorted (there's a suitable index) or the sort cost is justified.

Sorting and grouping

Sort — how to tell that data is going to disk

Sort
  Sort Key: created_at DESC
  Sort Method: external merge  Disk: 8192kB

external merge Disk: means the sort didn't fit in memory and went to disk. This is slow. There are two solutions: increase work_mem, or add an index with the needed sort order so the Sort node disappears from the plan.

If the sort is in memory, you'll see quicksort or top-N heapsort (for ORDER BY ... LIMIT N).

Grouping

  • Aggregate — simple aggregates: count(*), sum().
  • HashAggregateGROUP BY via a hash table in memory. Fast, as long as the data fits.
  • GroupAggregateGROUP BY after sorting. Used when HashAggregate doesn't fit in work_mem.

Parallel plan

Gather
  Workers Planned: 2
  Workers Launched: 2
  ->  Parallel Seq Scan on orders

PostgreSQL can split the work across several parallel workers. The count is set by the max_parallel_workers_per_gather parameter (2 by default). Small tables are not parallelized — the threshold is controlled by min_parallel_table_scan_size (8 MB by default).

Tools for complex plans

When a plan has five or more levels, reading it in text form is inconvenient. There are specialized tools for this:

  • explain.depesz.com — color-highlights the nodes, shows the most expensive steps.
  • explain.dalibo.com (PEV2) — a graphical plan tree right in the browser.
  • pg_stat_statements — an extension for collecting query statistics in production, without a manual EXPLAIN.

For PEV2 it's convenient to get the plan in JSON format:

EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) SELECT ...;

In short

  • EXPLAIN (ANALYZE, BUFFERS) — the standard option: you can see the real time and IO.
  • EXPLAIN ANALYZE runs the query; for DML use BEGIN ... ROLLBACK.
  • Read the plan bottom-up — the leaves run first.
  • A node's real time = actual time × loops — don't forget to multiply.
  • A discrepancy of 10x or more between rows= and actual rows= → run ANALYZE on the table.
  • A large Buffers: read= → data is coming from disk, you need an index or more cache.
  • A condition in Filter: instead of Index Cond: → the index isn't working as a search key.
  • Heap Fetches > 0 on an Index Only Scan → run VACUUM.
  • Batches > 1 in a Hash Join or external merge Disk: in a Sort → increase work_mem.
  • A Nested Loop with a large loops → check for an index on the join key.
  • PostgreSQL index types — what should appear in the plan with the right index.
  • Composite indexes — how an Index Only Scan depends on the index composition.
  • Selectivity — why rows= differs from actual rows=.
  • VACUUM and autovacuum — how to eliminate Heap Fetches.
  • Monitoring and slow queries — pg_stat_statements for production.