"The index is there, but the query is still slow" — one of the most common complaints when working with PostgreSQL. Almost always the cause is selectivity: PostgreSQL itself decides whether using an index pays off or whether it's simpler to read the whole table. Let's look at how it makes that decision and what to do when it decides wrong.
Why PostgreSQL sometimes doesn't use an index
Imagine a table of a million orders. Each order has a status field with five values: NEW, PAID, SHIPPED, DELIVERED, CANCELLED. Most orders are DELIVERED, roughly 62%.
You added an index on status and run WHERE status = 'DELIVERED'. PostgreSQL looks at the index and sees: it will lead to 620,000 rows out of 1,000,000. To fetch each of them, it has to jump to a random spot on disk. That's slower than simply reading the whole table sequentially. So PostgreSQL chooses a seq scan — and it's right.
The key term here is selectivity.
What selectivity is
Selectivity is the fraction of unique values in a column relative to the total number of rows:
selectivity = number_of_unique_values / total_rows
| Column | Unique | Total rows | Selectivity |
|---|---|---|---|
id (primary key) | 1,000,000 | 1,000,000 | 1.0 — perfect |
email | 998,000 | 1,000,000 | 0.998 — excellent |
customer_id | 50,000 | 1,000,000 | 0.05 — medium |
status (5 values) | 5 | 1,000,000 | 0.000005 — low |
is_deleted | 2 | 1,000,000 | 0.000002 — almost zero |
The higher the selectivity, the fewer rows the condition returns — and the more useful the index.
An index on is_deleted or status is practically useless: the query WHERE is_deleted = false will return 98% of the table. PostgreSQL will sensibly prefer a seq scan.
How PostgreSQL makes the decision
The planner estimates the percentage of rows the query condition will return and compares the cost of two paths:
| Fraction of rows by condition | Index? |
|---|---|
| less than 1% | yes, definitely |
| 1–5% | most likely yes |
| 5–20% | depends on row size and cache |
| more than 20% | most likely no |
| more than 50% | definitely no |
Two more parameters affect the decision: how much data PostgreSQL considers cached (effective_cache_size) and how much more expensive a random read is than a sequential one (random_page_cost).
Tuning random_page_cost on SSD
By default random_page_cost = 4.0 — this value was calibrated for hard drives, where a random read is four times slower than a sequential one. On SSD the difference almost disappears.
If you don't update this parameter, PostgreSQL will systematically choose a seq scan where an index is actually faster.
ALTER SYSTEM SET random_page_cost = 1.1;
SELECT pg_reload_conf();
The value 1.1 is the standard recommendation for SSD.
How to view column statistics
PostgreSQL stores statistics for each column in the system view pg_stats:
SELECT
attname,
n_distinct,
most_common_vals,
most_common_freqs,
null_frac
FROM pg_stats
WHERE schemaname = 'public' AND tablename = 'orders';
What the fields mean:
n_distinct— an estimate of the number of unique values. A positive number is an absolute count; a negative one (-0.05) is a fraction of the number of rows (5% unique).most_common_vals— an array of the most frequent values.most_common_freqs— frequencies from 0 to 1 for each of them.null_frac— the fraction of NULL values.
Example result for the status column:
attname | status
n_distinct | 5
most_common_vals | {DELIVERED,SHIPPED,NEW,CANCELLED,PAID}
most_common_freqs | {0.62, 0.18, 0.10, 0.06, 0.04}
From these statistics it's clear: WHERE status = 'DELIVERED' will return 62% of rows — seq scan. But WHERE status = 'PAID' will return 4% — an index can help.
A composite index where a single-column one is useless
The status column with five values is of little use for indexing on its own. But if you add a second column with high selectivity, the picture changes.
A composite index (status, created_at):
WHERE status = 'DELIVERED'— 62% of rows → seq scan.WHERE status = 'DELIVERED' AND created_at > now() - interval '1 day'— roughly 62% × 0.5% = 0.3% → the index is a perfect fit.
The selectivities of the two conditions multiply. That's why a composite index can work where a single-column one is useless — but both conditions must be present in the query.
How to check what actually happens
EXPLAIN ANALYZE shows both the plan and the actual execution:
EXPLAIN ANALYZE
SELECT * FROM orders WHERE status = 'PAID';
Seq Scan on orders (cost=0.00..18334.00 rows=400000 width=...) (actual time=0.012..89.123 rows=42137 loops=1)
Filter: (status = 'PAID')
Rows Removed by Filter: 957863
The important thing here is to compare two numbers:
rows=400000— the planner's estimate.actual rows=42137— the real number of rows.
A discrepancy of nearly ten times means the statistics are stale and the planner made its decision based on wrong data.
When statistics go stale and what to do
Run ANALYZE manually
Autovacuum triggers statistics collection automatically when about 10% of the rows in a table have changed. During a bulk data load it can't keep up. After a large load it's worth running it manually:
ANALYZE orders; -- one table
ANALYZE orders (status, created_at); -- only the needed columns
ANALYZE; -- all tables in the schema
Increase the depth of statistics
By default PostgreSQL keeps statistics for the 100 most frequent values (statistics_target = 100). If a column has rare peaks — for example, country_code with 99% of values being RU plus another 50 rare countries — they won't make it into the statistics, and the planner will misjudge their selectivity.
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;
After this PostgreSQL will keep up to 1000 of the most frequent values for that column.
Correlated columns
It happens that two columns are related to each other: for example, country_code and currency. The country Russia almost always means the ruble. By default PostgreSQL treats columns as independent and computes selectivity as the product of the two conditions — this overestimates by tens of times.
For such cases there is extended statistics:
CREATE STATISTICS stats_orders_geo (dependencies, ndistinct)
ON country_code, currency FROM orders;
ANALYZE orders;
After this the planner accounts for the dependency between the columns and estimates the selectivity of WHERE country_code = 'RU' AND currency = 'RUB' correctly — not as a product, but as the probability of one of the conditions.
Common mistakes
An index on a boolean column. WHERE is_deleted = false is almost always 98%+ of the rows. An index won't help. If you need partial coverage, a partial index is better: CREATE INDEX ON orders (created_at) WHERE deleted_at IS NULL — it covers only the current rows.
Ignoring a discrepancy between rows and actual rows. A tenfold discrepancy in EXPLAIN ANALYZE is a signal that the statistics are stale. The fix: ANALYZE.
Leaving random_page_cost = 4.0 on SSD. The planner will underestimate the usefulness of indexes. A single ALTER SYSTEM fixes it.
Not running ANALYZE after a bulk load. Autovacuum can't keep up with a fast insert. After ETL or a migration — run it manually.
In short
- Selectivity = unique values / total rows. Higher is better for an index.
- An index on
is_deletedorstatuswith 5 values is almost useless: the planner will choose a seq scan. - Rule of thumb: less than 5% of rows by condition → index, more than 20% → seq scan.
- On SSD it's mandatory:
random_page_cost = 1.1(the default 4.0 is for a hard drive). - A composite index
(status, created_at)can work where(status)is useless. pg_statsshows what the planner knows:n_distinct,most_common_freqs.- A 10x discrepancy between
rowsandactual rowsinEXPLAIN ANALYZE→ runANALYZE. - After a bulk load —
ANALYZEmanually; autovacuum can't keep up. - For columns with rare peaks: increase
statistics_targetto 1000. - Correlated columns (
country_code+currency) →CREATE STATISTICS ... (dependencies).
What to read next
- Composite Indexes in PostgreSQL — column order and the leftmost prefix.
- EXPLAIN ANALYZE — how to read a query plan in detail.
- VACUUM and autovacuum — when and how statistics are updated.