How to Choose an Index and Estimate Selectivity in PostgreSQL

Java · Go · Node · Python

In plain words: what selectivity is, why PostgreSQL sometimes ignores an index and does a seq scan, how to read pg_stats and EXPLAIN ANALYZE, when a composite index helps, and how to keep statistics up to date.

Published 27 июня 2026 ~7 min read Author: Вадим Викулин

← Back to the section

"The index is there, but the query is still slow" — one of the most common complaints when working with PostgreSQL. Almost always the cause is selectivity: PostgreSQL itself decides whether using an index pays off or whether it's simpler to read the whole table. Let's look at how it makes that decision and what to do when it decides wrong.

Why PostgreSQL sometimes doesn't use an index

Imagine a table of a million orders. Each order has a status field with five values: NEW, PAID, SHIPPED, DELIVERED, CANCELLED. Most orders are DELIVERED, roughly 62%.

You added an index on status and run WHERE status = 'DELIVERED'. PostgreSQL looks at the index and sees: it will lead to 620,000 rows out of 1,000,000. To fetch each of them, it has to jump to a random spot on disk. That's slower than simply reading the whole table sequentially. So PostgreSQL chooses a seq scan — and it's right.

The key term here is selectivity.

What selectivity is

Selectivity is the fraction of unique values in a column relative to the total number of rows:

selectivity = number_of_unique_values / total_rows

Column	Unique	Total rows	Selectivity
`id` (primary key)	1,000,000	1,000,000	1.0 — perfect
`email`	998,000	1,000,000	0.998 — excellent
`customer_id`	50,000	1,000,000	0.05 — medium
`status` (5 values)	5	1,000,000	0.000005 — low
`is_deleted`	2	1,000,000	0.000002 — almost zero

The higher the selectivity, the fewer rows the condition returns — and the more useful the index.

An index on is_deleted or status is practically useless: the query WHERE is_deleted = false will return 98% of the table. PostgreSQL will sensibly prefer a seq scan.

How PostgreSQL makes the decision

The planner estimates the percentage of rows the query condition will return and compares the cost of two paths:

Fraction of rows by condition	Index?
less than 1%	yes, definitely
1–5%	most likely yes
5–20%	depends on row size and cache
more than 20%	most likely no
more than 50%	definitely no

Two more parameters affect the decision: how much data PostgreSQL considers cached (effective_cache_size) and how much more expensive a random read is than a sequential one (random_page_cost).

Tuning random_page_cost on SSD

By default random_page_cost = 4.0 — this value was calibrated for hard drives, where a random read is four times slower than a sequential one. On SSD the difference almost disappears.

If you don't update this parameter, PostgreSQL will systematically choose a seq scan where an index is actually faster.

ALTER SYSTEM SET random_page_cost = 1.1;
SELECT pg_reload_conf();

The value 1.1 is the standard recommendation for SSD.

How to view column statistics

PostgreSQL stores statistics for each column in the system view pg_stats:

SELECT
    attname,
    n_distinct,
    most_common_vals,
    most_common_freqs,
    null_frac
FROM pg_stats
WHERE schemaname = 'public' AND tablename = 'orders';

What the fields mean:

n_distinct — an estimate of the number of unique values. A positive number is an absolute count; a negative one (-0.05) is a fraction of the number of rows (5% unique).
most_common_vals — an array of the most frequent values.
most_common_freqs — frequencies from 0 to 1 for each of them.
null_frac — the fraction of NULL values.

Example result for the status column:

attname           | status
n_distinct        | 5
most_common_vals  | {DELIVERED,SHIPPED,NEW,CANCELLED,PAID}
most_common_freqs | {0.62, 0.18, 0.10, 0.06, 0.04}

From these statistics it's clear: WHERE status = 'DELIVERED' will return 62% of rows — seq scan. But WHERE status = 'PAID' will return 4% — an index can help.

A composite index where a single-column one is useless

The status column with five values is of little use for indexing on its own. But if you add a second column with high selectivity, the picture changes.

A composite index (status, created_at):

WHERE status = 'DELIVERED' — 62% of rows → seq scan.
WHERE status = 'DELIVERED' AND created_at > now() - interval '1 day' — roughly 62% × 0.5% = 0.3% → the index is a perfect fit.

The selectivities of the two conditions multiply. That's why a composite index can work where a single-column one is useless — but both conditions must be present in the query.

How to check what actually happens

EXPLAIN ANALYZE shows both the plan and the actual execution:

EXPLAIN ANALYZE
SELECT * FROM orders WHERE status = 'PAID';

Seq Scan on orders  (cost=0.00..18334.00 rows=400000 width=...) (actual time=0.012..89.123 rows=42137 loops=1)
  Filter: (status = 'PAID')
  Rows Removed by Filter: 957863

The important thing here is to compare two numbers:

rows=400000 — the planner's estimate.
actual rows=42137 — the real number of rows.

A discrepancy of nearly ten times means the statistics are stale and the planner made its decision based on wrong data.

When statistics go stale and what to do

Run ANALYZE manually

Autovacuum triggers statistics collection automatically when about 10% of the rows in a table have changed. During a bulk data load it can't keep up. After a large load it's worth running it manually:

ANALYZE orders;                            -- one table
ANALYZE orders (status, created_at);      -- only the needed columns
ANALYZE;                                   -- all tables in the schema

Increase the depth of statistics

By default PostgreSQL keeps statistics for the 100 most frequent values (statistics_target = 100). If a column has rare peaks — for example, country_code with 99% of values being RU plus another 50 rare countries — they won't make it into the statistics, and the planner will misjudge their selectivity.

ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;

After this PostgreSQL will keep up to 1000 of the most frequent values for that column.

Correlated columns

It happens that two columns are related to each other: for example, country_code and currency. The country Russia almost always means the ruble. By default PostgreSQL treats columns as independent and computes selectivity as the product of the two conditions — this overestimates by tens of times.

For such cases there is extended statistics:

CREATE STATISTICS stats_orders_geo (dependencies, ndistinct)
    ON country_code, currency FROM orders;
ANALYZE orders;

After this the planner accounts for the dependency between the columns and estimates the selectivity of WHERE country_code = 'RU' AND currency = 'RUB' correctly — not as a product, but as the probability of one of the conditions.

Common mistakes

An index on a boolean column. WHERE is_deleted = false is almost always 98%+ of the rows. An index won't help. If you need partial coverage, a partial index is better: CREATE INDEX ON orders (created_at) WHERE deleted_at IS NULL — it covers only the current rows.

Ignoring a discrepancy between rows and actual rows. A tenfold discrepancy in EXPLAIN ANALYZE is a signal that the statistics are stale. The fix: ANALYZE.

Leaving random_page_cost = 4.0 on SSD. The planner will underestimate the usefulness of indexes. A single ALTER SYSTEM fixes it.

Not running ANALYZE after a bulk load. Autovacuum can't keep up with a fast insert. After ETL or a migration — run it manually.

In short

Selectivity = unique values / total rows. Higher is better for an index.
An index on is_deleted or status with 5 values is almost useless: the planner will choose a seq scan.
Rule of thumb: less than 5% of rows by condition → index, more than 20% → seq scan.
On SSD it's mandatory: random_page_cost = 1.1 (the default 4.0 is for a hard drive).
A composite index (status, created_at) can work where (status) is useless.
pg_stats shows what the planner knows: n_distinct, most_common_freqs.
A 10x discrepancy between rows and actual rows in EXPLAIN ANALYZE → run ANALYZE.
After a bulk load — ANALYZE manually; autovacuum can't keep up.
For columns with rare peaks: increase statistics_target to 1000.
Correlated columns (country_code + currency) → CREATE STATISTICS ... (dependencies).