B-Tree Index Optimization: Execution Plan Diagnostics & Query Tuning #
B-tree indexes are PostgreSQL’s default access method, yet their performance degrades silently under skewed data distributions, misordered columns, or unchecked table bloat.
When the Optimizer Chooses a B-Tree Scan #
The planner selects a B-tree scan when the filter predicate matches an indexed column using equality or range operators (=, <, >, BETWEEN, IN on small sets). The decision is cost-driven: PostgreSQL computes an expected cost for the index path against the expected cost of a sequential scan, factoring in random_page_cost, seq_page_cost, and the table’s selectivity estimate derived from pg_statistic.
Three conditions trigger a B-tree path over a sequential scan:
- The predicate matches the index’s leftmost prefix — partial prefix matches still apply but only the matching prefix columns drive the index traversal.
- The planner’s row-count estimate for the filtered result set is small enough that random-access page fetches cost less than a full table sweep.
effective_cache_sizeis set high enough that the planner trusts the OS page cache to hold index pages, reducing the cost penalty of random I/O.
When none of these conditions hold, the planner falls back to a sequential scan or a Bitmap Index Scan. Understanding when the optimizer reaches for an index — and when it doesn’t — is the foundation of index tuning strategy. At least one sibling technique, partial index implementation, narrows the index to a filtered subset of rows and can make the cost calculation tip decisively toward an index path even when selectivity is moderate.
Annotated EXPLAIN Node Breakdown #
Run EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) before touching any index. The node fields that matter most for B-tree diagnosis:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT user_id, status, created_at
FROM orders
WHERE status = 'pending'
AND created_at >= '2024-01-01';
Index Scan using idx_orders_status on orders
(cost=0.43..142.89 rows=1200 width=48) -- planner estimate
(actual time=0.052..18.431 rows=1185 loops=1) -- measured at runtime
Index Cond: (status = 'pending'::text) -- index-driven predicate
Filter: (created_at >= '2024-01-01'::date) -- post-index filter (heap visit)
Rows Removed by Filter: 452 -- pages fetched and discarded
Heap Fetches: 1637 -- random heap I/O — target: 0
Buffers: shared hit=12 read=1405 -- 1405 pages fetched from disk
Execution Time: 18.912 ms
Key signals:
Index CondvsFilter: Predicates underIndex Condare resolved inside the B-tree. Predicates underFilterare applied after visiting the heap — they generate random I/O even when the row is eventually discarded.Heap Fetches: Every non-zero value means the engine visited the heap page to retrieve columns not stored in the index. The goal after tuning is zero.Buffers: shared read: High values with lowshared hitindicate the working set is not in the buffer pool, amplifying latency on every index scan.- Row estimate accuracy:
rows=1200vsactual rows=1185is close here. A factor-of-10 or larger divergence signals stale statistics or data skew — the root cause of many plan regressions. The cost estimation models page explains how PostgreSQL derives these estimates from histogram buckets and MCV lists.
Algorithm Internals: B-Tree Traversal and the Leftmost Prefix Rule #
A B-tree index is a balanced tree of sorted key pages. Traversal always begins at the root, descends through internal nodes comparing key values, and lands on a leaf page that holds pointers (ctid) to heap rows. The traversal is O(log N) in page reads — extremely fast for small result sets.
The leftmost prefix constraint is fundamental: the B-tree can only drive traversal on a contiguous prefix of the indexed columns, starting from the first. An index on (status, created_at, user_id) supports these access patterns:
| Predicate | Index usage |
|---|---|
WHERE status = 'pending' |
Full prefix on status |
WHERE status = 'pending' AND created_at >= '2024-01-01' |
Full prefix, range on second col |
WHERE created_at >= '2024-01-01' |
No prefix match — seq scan |
WHERE status = 'pending' AND user_id = 42 |
Index on status only; user_id skipped |
A range predicate on the leading column (WHERE created_at >= '2024-01-01' without equality on status) forces the planner to scan a contiguous range of leaf pages bounded only by the range start. That consumes far more leaf pages than equality filtering first.
Before/after plan comparison — fixing column order:
Before (index on (created_at, status), range predicate leads):
Index Scan using idx_orders_created_status on orders
(cost=0.43..298.71 rows=1200 width=48)
Rows Removed by Filter: 8241 -- large post-index filter
Heap Fetches: 9441
After (index on (status, created_at), equality predicate leads):
Index Scan using idx_orders_status_created on orders
(cost=0.43..142.89 rows=1200 width=48)
Rows Removed by Filter: 452
Heap Fetches: 1637
The corrected column order eliminates 88% of heap fetches before the INCLUDE optimization even applies.
Memory, I/O, and Resource Behavior #
B-tree index scans do not consume work_mem during traversal — the tree walk itself is buffer-pool resident. However, two resource behaviors deserve attention:
Buffer pool pressure from high Heap Fetches. Each heap fetch is a random page read. On spinning disks, random I/O cost is 4–10× sequential. Even on SSDs, a plan generating 10,000 heap fetches across a large table will evict useful pages from shared_buffers, degrading concurrent queries. The ratio of Buffers: shared read to shared hit in EXPLAIN BUFFERS output directly quantifies this pressure.
Index bloat from dead tuples. PostgreSQL’s MVCC model does not immediately reclaim space when rows are updated or deleted. Dead tuple versions accumulate in both the heap and index leaf pages. Index bloat forces the planner to traverse more leaf pages per scan — even when the live row density is low — and inflates cost estimates. Monitor bloat with:
SELECT
indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'orders'
ORDER BY idx_scan DESC;
Low idx_tup_fetch relative to idx_tup_read is a secondary signal: the engine traversed many index entries but retrieved far fewer heap rows — consistent with high bloat or a poorly selective index condition.
random_page_cost calibration. The planner’s decision to use a B-tree versus a sequential scan is sensitive to random_page_cost. The default is 4.0, calibrated for spinning disk. For NVMe-backed instances, setting random_page_cost = 1.1 to 1.5 corrects the planner’s over-weighting of random I/O and recovers index paths that were incorrectly suppressed:
SET LOCAL random_page_cost = 1.1;
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
Step-by-Step Tuning Workflow #
Step 1 — Capture the baseline plan.
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT user_id, status, created_at
FROM orders
WHERE status = 'pending'
AND created_at >= '2024-01-01';
Record scan node type, Heap Fetches, Buffers: shared read, and row estimate accuracy.
Step 2 — Inspect column cardinality before redesigning the index.
SELECT attname, n_distinct, correlation, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'orders'
AND attname IN ('status', 'created_at', 'user_id');
n_distinct: negative values mean PostgreSQL expresses cardinality as a fraction of total rows (e.g., -0.05 = 5% distinct). correlation: values near 1.0 or -1.0 mean the physical row order matches the column order — B-tree range scans are cheapest when correlation is high.
Step 3 — Validate effective_cache_size and random_page_cost.
SHOW effective_cache_size;
SHOW random_page_cost;
On cloud instances with NVMe, random_page_cost is often too high. Adjust per-session to test the planner’s response before making a permanent change.
Step 4 — Rebuild the index with corrected column order.
-- Drop the old index once the replacement is ready
CREATE INDEX CONCURRENTLY idx_orders_status_created
ON orders (status, created_at);
Use CONCURRENTLY to avoid locking writes during the build.
Step 5 — Add an INCLUDE clause to eliminate heap fetches.
Once the column order is correct, eliminate remaining Heap Fetches by moving projected-only columns into the INCLUDE clause. This enables an Index Only Scan and covering index path without widening the sort keys:
CREATE INDEX CONCURRENTLY idx_orders_status_created_covering
ON orders (status, created_at)
INCLUDE (user_id);
Step 6 — Refresh statistics and re-run EXPLAIN.
ANALYZE orders;
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT user_id, status, created_at
FROM orders
WHERE status = 'pending'
AND created_at >= '2024-01-01';
Confirm: Index Only Scan node, Heap Fetches: 0, and Buffers: shared read near zero.
Step 7 — Drop the old index after confirming the new plan.
DROP INDEX CONCURRENTLY idx_orders_status;
Common Pitfalls #
Function wrapping on indexed columns.
Diagnostic signal: Plan shows Seq Scan or a filter that contains date(created_at) or lower(email) in EXPLAIN VERBOSE output, even though the column is indexed.
Fix: Create a functional index on the same expression (CREATE INDEX ON orders (date(created_at))), or rewrite the predicate to use a range on the raw column (created_at >= '2024-01-01' AND created_at < '2024-01-02').
Leading wildcard LIKE patterns.
Diagnostic signal: LIKE '%value' or LIKE '%value%' appears in the plan filter; the B-tree is bypassed entirely.
Fix: B-trees cannot index a leading wildcard. Install the pg_trgm extension and create a GIN trigram index, or restructure the query to use full-text search predicates.
Ignoring index bloat after high-churn workloads.
Diagnostic signal: Latency creep over weeks; pg_stat_user_indexes.idx_tup_fetch diverges downward from idx_tup_read; pgstattuple reports dead leaf ratio above 20%.
Fix: REINDEX CONCURRENTLY during a maintenance window. Tighten autovacuum thresholds for the table (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_cost_delay = 2ms) to prevent recurrence.
Over-indexing write-heavy tables.
Diagnostic signal: High INSERT/UPDATE latency; pg_stat_bgwriter shows sustained checkpoint pressure; pg_stat_user_indexes reveals indexes with idx_scan = 0 over a 30-day window.
Fix: Audit and drop zero-scan indexes. Consolidate overlapping indexes into a single index with INCLUDE columns rather than maintaining two separate structures.
Stale statistics causing plan regressions after bulk loads.
Diagnostic signal: rows estimate in EXPLAIN diverges from actual rows by 10× or more; the planner chooses a Seq Scan immediately after a large INSERT.
Fix: Run ANALYZE table_name explicitly after bulk operations. For very large tables, consider increasing default_statistics_target on high-variance columns:
ALTER TABLE orders ALTER COLUMN created_at SET STATISTICS 500;
ANALYZE orders;
B-tree applied to the wrong operator class.
Diagnostic signal: Seq Scan persists despite an index on the filtered column; the predicate uses @>, &&, @@, or geometric operators.
Fix: B-trees only index scalar comparison operators. Pivot to specialized index types such as GIN or GiST for containment, full-text, and spatial predicates. The decision between these access methods is detailed on the when to use GIN over B-tree page.
Frequently Asked Questions #
How do I verify if a B-tree index is actually being used?
Run EXPLAIN (ANALYZE, BUFFERS) on the target query. Look for an Index Scan or Index Only Scan node. Check that Heap Fetches are near zero — higher values mean the engine is still visiting the heap for uncovered columns. pg_stat_user_indexes.idx_scan also confirms historical usage across all executions since the last statistics reset.
Why does the planner ignore my B-tree index despite a matching WHERE clause?
The most common causes are: the filtered value is so frequent that a sequential scan genuinely costs less (low selectivity), statistics are stale so the planner misestimates row counts, random_page_cost is set too high for the underlying storage tier, or a function is applied to the indexed column. Run ANALYZE table_name, use SET LOCAL random_page_cost = 1.1 to test the planner’s sensitivity, and verify predicates match the index prefix without wrapping functions.
When should I use INCLUDE columns versus adding them to the main B-tree key?
Use INCLUDE for columns that appear only in the SELECT list but never in WHERE, ORDER BY, or JOIN ON clauses. Placing them in the main key widens every internal node, increases index depth, and forces the planner to sort on columns that carry no filtering benefit. INCLUDE columns are stored only in leaf pages, keeping sort keys lean while enabling Index Only Scan paths.
How frequently should B-tree indexes be rebuilt?
Trigger a REINDEX CONCURRENTLY when pgstattuple reports dead leaf ratio above 20–30%, or after major bulk delete/update operations. For steady-state tables with aggressive autovacuum, annual rebuilds are sufficient. On high-churn tables (e.g., queues or event logs with frequent deletes), monthly monitoring via pgstattuple is prudent.
Related #
- Index Tuning & Strategy — parent section covering the full index access-path taxonomy
- Covering Index Design — detailed guide to eliminating heap fetches with
INCLUDEand composite keys - Partial Index Implementation — narrow indexes that reduce size and improve selectivity for filtered subsets
- Specialized Index Types (GIN/GiST) — when to abandon B-tree for containment, text search, and spatial queries
- Sequential vs Index Scans — cost-model mechanics that govern when the planner selects each scan type