Full-text search in PostgreSQL — tsvector, GIN, russian

Java · Go · Node · Python

How full-text search works in PostgreSQL: tsvector and tsquery, GIN index, Russian stemming, ranking with ts_rank, match highlighting, and fuzzy search with pg_trgm.

Published 27 июня 2026 ~7 min read Author: Вадим Викулин

← Back to the section

When you need to search over text, the first impulse is to reach for Elasticsearch. But PostgreSQL can search over text out of the box, without a separate service. Let's look at how it works and when it's good enough.

Why plain LIKE doesn't work

The most obvious approach is WHERE body LIKE '%buyer%'. It has two drawbacks.

First — speed. When searching with a leading %, PostgreSQL can't use a regular index and has to scan every row. On a thousand records it's unnoticeable; on a million it's a disaster.

Second — grammar. The query LIKE '%buyer%' won't find the row «buying products», because the word form is different. Users type words in various forms, and LIKE doesn't account for that.

Full-text search solves both problems: it works through an index and understands stemming — reducing words to their root.

tsvector and tsquery — two key types

PostgreSQL stores processed text in a special type called tsvector. It's not just text but a list of lexemes — words reduced to their root form with positions recorded.

SELECT to_tsvector('english', 'Buyers choose products in the catalog');
-- 'buyer':1 'catalog':6 'choos':2 'product':3

«Buyers» became «buyer» and «choose» became «choos» — that's the root that any form of the word maps to. The stop-words «in» and «the» are dropped as insignificant.

A search query is stored in the tsquery type:

SELECT to_tsquery('english', 'buyer & catalog');
-- 'buyer' & 'catalog'

Match checking is done with the @@ operator:

SELECT to_tsvector('english', 'Buyers choose products in the catalog')
    @@ to_tsquery('english', 'buyer & catalog');
-- true

How to store tsvector in a table

Computing to_tsvector() on every query is slow and has no index. The right approach is to store the vector separately and index it.

Generated column (PostgreSQL 12+)

CREATE TABLE article (
    id          bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    title       text NOT NULL,
    body        text NOT NULL,
    search_doc  tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(body, '')), 'B')
    ) STORED
);

CREATE INDEX ix_article_search_doc ON article USING gin (search_doc);

GENERATED ALWAYS AS ... STORED means PostgreSQL recomputes the column on every INSERT and UPDATE. You don't need to do it by hand.

setweight('A') marks tokens from the title as more important — this affects ranking. Weights A, B, C, D are available, from highest to lowest.

Trigger (PostgreSQL before 12)

CREATE TRIGGER article_search_update
BEFORE INSERT OR UPDATE ON article
FOR EACH ROW EXECUTE FUNCTION
tsvector_update_trigger(search_doc, 'pg_catalog.english', title, body);

How to search

A basic query with ranking:

SELECT id, title, ts_rank(search_doc, q) AS rank
FROM article, to_tsquery('english', 'buyer & product') q
WHERE search_doc @@ q
ORDER BY rank DESC
LIMIT 20;

ts_rank returns a number — the higher, the better the match. With an equal number of matches, documents where the words appeared in high-weight fields rank higher (A > B > C > D).

plainto_tsquery — for user input

to_tsquery requires proper syntax with the operators &, |, !. If arbitrary user input lands there, the query may fail with an error.

For user input, use plainto_tsquery — it interprets words joined by AND without any special syntax:

WHERE search_doc @@ plainto_tsquery('english', $1);
-- 'buy a new product' → 'buy' & 'new' & 'product'

websearch_to_tsquery — Google-style

Available since PostgreSQL 11. It understands a minus sign for excluding words and quotes for phrase search:

WHERE search_doc @@ websearch_to_tsquery('english', $1);
-- 'buyer -child "new catalog"'

Stemming configuration

PostgreSQL ships with several full-text search configurations:

SELECT cfgname FROM pg_ts_config;
-- simple, english, russian, german, ...

english — English-language stemming. «buyers», «buying», «buy» become one token, and search finds all forms.
simple — lowercasing only, no stemming applied. Useful for codes, identifiers, SKUs.
other languages have their own configurations (russian, german, and so on).

For English-language content you almost always want the english configuration.

Synonyms

If you want a search for «postgres» to also find «postgresql» and «pg», you can create a synonym dictionary:

CREATE TEXT SEARCH DICTIONARY my_synonyms (
    template = synonym,
    synonyms = 'my_synonyms'
);

CREATE TEXT SEARCH CONFIGURATION en_extended (COPY = english);
ALTER TEXT SEARCH CONFIGURATION en_extended
    ALTER MAPPING FOR word, asciiword
    WITH my_synonyms, english_stem;

In the file $SHAREDIR/tsearch_data/my_synonyms.syn:

postgresql postgres
postgres postgres pg

GIN vs GiST

Two index types are suitable for full-text search:

	GIN	GiST
Size on disk	larger	smaller
Search speed	faster	slower
Update speed	slower	faster
False positives	none	yes (rechecks)

In the vast majority of cases GIN is the choice — it's faster on search, and that usually matters more.

GIN accumulates changes in a buffer and flushes them in batches (fastupdate = on by default). For systems with rare inserts and frequent reads, the buffer can be turned off:

ALTER INDEX ix_article_search_doc SET (fastupdate = off);

Match highlighting

ts_headline highlights the found words directly in the text:

SELECT
    id,
    title,
    ts_headline('english', body, q,
        'StartSel=<mark>, StopSel=</mark>, MaxFragments=2, MaxWords=20')
        AS snippet
FROM article, websearch_to_tsquery('english', $1) q
WHERE search_doc @@ q
ORDER BY ts_rank(search_doc, q) DESC
LIMIT 20;

MaxFragments=2 — show no more than two fragments; MaxWords=20 — the length of each fragment in words.

pg_trgm — fuzzy search and substrings

Full-text search won't help if you need to:

find a product by part of its SKU (LIKE '%ABC%'),
or correct a typo in a name (for example, «Robertsen» instead of «Robertson»).

For that there's the pg_trgm extension. It breaks a string into three-letter groups (trigrams) and builds an index over them:

CREATE EXTENSION pg_trgm;

CREATE INDEX ix_customer_name_trgm
    ON customer USING gin (full_name gin_trgm_ops);

-- substring search with an index
SELECT * FROM customer WHERE full_name ILIKE '%robert%';

-- search with tolerance for typos
SELECT * FROM customer
WHERE similarity(full_name, 'robertsen') > 0.4
ORDER BY similarity(full_name, 'robertsen') DESC
LIMIT 10;

A good combination: FTS for long text (article body, description), pg_trgm for short fields with typos (names, brands, codes). On fields longer than 100 characters pg_trgm loses its value — FTS is better there.

Pagination via keyset

OFFSET + LIMIT is slow on deep pages: PostgreSQL still computes and sorts all matches, then discards the requested count. On the 10th page of 50 results that's wasted work.

Instead of OFFSET, use keyset pagination on the pair (rank, id):

SELECT id, title, ts_rank(search_doc, q) AS rank
FROM article, websearch_to_tsquery('english', $1) q
WHERE search_doc @@ q
  AND (ts_rank(search_doc, q), id) < ($prev_rank, $prev_id)
ORDER BY rank DESC, id DESC
LIMIT 20;

$prev_rank and $prev_id are the values of the last record from the previous page.

When PostgreSQL FTS is enough, and when you need Elasticsearch

PostgreSQL FTS handles searching over articles, products, comments, and tickets well — up to about 10 million documents and a load of up to 100 queries per second.

Elasticsearch is worth considering if you need:

volumes significantly larger than 10 million documents with complex ranking,
automatic language detection in multilingual content,
facets, aggregations, analytics over search results,
sophisticated typo tolerance.

Common mistakes

Computing on the fly. Writing WHERE to_tsvector('english', body) @@ ... without a stored column means the query doesn't use an index and scans the whole table.

The simple configuration for natural-language text. Without stemming «buyer» and «buyers» are different words, and search won't find what you need.

No weights. If you don't use setweight, the title and body have the same priority — a document where the word appears only in the body will rank above a document with it in the title.

to_tsquery for user input. If a user enters a special character, the query fails with an error. Use plainto_tsquery or websearch_to_tsquery.

OFFSET on deep pages. Slow and inefficient — use keyset pagination.

In short

PostgreSQL FTS works through an index and understands grammar — unlike LIKE.
tsvector is the indexable representation of text, tsquery is the search query, @@ is the match operator.
The english configuration provides stemming: different forms of the same word count as one token.
Store tsvector in a generated column (GENERATED ALWAYS AS ... STORED) and index it with GIN.
setweight sets field priorities — the title matters more than the body, and that's factored into ranking.
For user input, use plainto_tsquery or websearch_to_tsquery, not to_tsquery.
ts_headline highlights matches in the text.
pg_trgm complements FTS: fuzzy search and substring search on short fields.
For pagination, use keyset on (rank, id), not OFFSET.
FTS is enough up to ~10M documents and ~100 RPS; for larger volumes, use Elasticsearch.