B-Tree Index Optimization: Execution Plan Diagnostics & Query Tuning #

Q: How do I verify if a B-tree index is actually being used?

Run EXPLAIN (ANALYZE, BUFFERS) on the target query. Look for Index Scan or Index Only Scan nodes. Confirm that Buffers: shared read values are low and that Heap Fetches approach zero for optimal performance.

Q: Why does the planner ignore my B-tree index despite a matching WHERE clause?

Common causes: low table cardinality relative to the filter selectivity, outdated statistics, skewed data distribution, or functions applied to indexed columns. Run ANALYZE table_name, verify random_page_cost matches your storage tier, and ensure predicates match the index leftmost prefix exactly.

Q: When should I use INCLUDE columns versus adding them to the main B-tree key?

Use INCLUDE for columns that appear only in SELECT projections but never in WHERE, ORDER BY, or JOIN conditions. Adding them to the key increases index depth, degrades write performance, and complicates sort-direction logic. INCLUDE keeps the sort hierarchy lean while enabling Index Only Scans.

Q: How frequently should B-tree indexes be rebuilt?

Reindex when bloat exceeds 20–30% of total index size, or after bulk deletions and large data loads. Use REINDEX CONCURRENTLY in production to avoid table locks. Monitor bloat with pgstattuple and schedule rebuilds during low-traffic windows.

B-tree indexes are PostgreSQL’s default access method, yet their performance degrades silently under skewed data distributions, misordered columns, or unchecked table bloat.

When the Optimizer Chooses a B-Tree Scan #

The planner selects a B-tree scan when the filter predicate matches an indexed column using equality or range operators (=, <, >, BETWEEN, IN on small sets). The decision is cost-driven: PostgreSQL computes an expected cost for the index path against the expected cost of a sequential scan, factoring in random_page_cost, seq_page_cost, and the table’s selectivity estimate derived from pg_statistic.

Three conditions trigger a B-tree path over a sequential scan:

The predicate matches the index’s leftmost prefix — partial prefix matches still apply but only the matching prefix columns drive the index traversal.
The planner’s row-count estimate for the filtered result set is small enough that random-access page fetches cost less than a full table sweep.
effective_cache_size is set high enough that the planner trusts the OS page cache to hold index pages, reducing the cost penalty of random I/O.

When none of these conditions hold, the planner falls back to a sequential scan or a Bitmap Index Scan. Understanding when the optimizer reaches for an index — and when it doesn’t — is the foundation of index tuning strategy. At least one sibling technique, partial index implementation, narrows the index to a filtered subset of rows and can make the cost calculation tip decisively toward an index path even when selectivity is moderate.

Annotated EXPLAIN Node Breakdown #

Run EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) before touching any index. The node fields that matter most for B-tree diagnosis:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT user_id, status, created_at
FROM orders
WHERE status = 'pending'
  AND created_at >= '2024-01-01';

Index Scan using idx_orders_status on orders
  (cost=0.43..142.89 rows=1200 width=48)        -- planner estimate
  (actual time=0.052..18.431 rows=1185 loops=1)  -- measured at runtime
  Index Cond: (status = 'pending'::text)         -- index-driven predicate
  Filter: (created_at >= '2024-01-01'::date)     -- post-index filter (heap visit)
  Rows Removed by Filter: 452                    -- pages fetched and discarded
  Heap Fetches: 1637                             -- random heap I/O — target: 0
  Buffers: shared hit=12 read=1405               -- 1405 pages fetched from disk
Execution Time: 18.912 ms

Key signals:

Index Cond vs Filter: Predicates under Index Cond are resolved inside the B-tree. Predicates under Filter are applied after visiting the heap — they generate random I/O even when the row is eventually discarded.
Heap Fetches: Every non-zero value means the engine visited the heap page to retrieve columns not stored in the index. The goal after tuning is zero.
Buffers: shared read: High values with low shared hit indicate the working set is not in the buffer pool, amplifying latency on every index scan.
Row estimate accuracy: rows=1200 vs actual rows=1185 is close here. A factor-of-10 or larger divergence signals stale statistics or data skew — the root cause of many plan regressions. The cost estimation models page explains how PostgreSQL derives these estimates from histogram buckets and MCV lists.

Algorithm Internals: B-Tree Traversal and the Leftmost Prefix Rule #

A B-tree index is a balanced tree of sorted key pages. Traversal always begins at the root, descends through internal nodes comparing key values, and lands on a leaf page that holds pointers (ctid) to heap rows. The traversal is O(log N) in page reads — extremely fast for small result sets.

The leftmost prefix constraint is fundamental: the B-tree can only drive traversal on a contiguous prefix of the indexed columns, starting from the first. An index on (status, created_at, user_id) supports these access patterns:

Predicate	Index usage
`WHERE status = 'pending'`	Full prefix on `status`
`WHERE status = 'pending' AND created_at >= '2024-01-01'`	Full prefix, range on second col
`WHERE created_at >= '2024-01-01'`	No prefix match — seq scan
`WHERE status = 'pending' AND user_id = 42`	Index on `status` only; `user_id` skipped

A range predicate on the leading column (WHERE created_at >= '2024-01-01' without equality on status) forces the planner to scan a contiguous range of leaf pages bounded only by the range start. That consumes far more leaf pages than equality filtering first.

Before/after plan comparison — fixing column order:

Before (index on (created_at, status), range predicate leads):

Index Scan using idx_orders_created_status on orders
  (cost=0.43..298.71 rows=1200 width=48)
  Rows Removed by Filter: 8241              -- large post-index filter
  Heap Fetches: 9441

After (index on (status, created_at), equality predicate leads):

Index Scan using idx_orders_status_created on orders
  (cost=0.43..142.89 rows=1200 width=48)
  Rows Removed by Filter: 452
  Heap Fetches: 1637

The corrected column order eliminates 88% of heap fetches before the INCLUDE optimization even applies.

Memory, I/O, and Resource Behavior #

B-tree index scans do not consume work_mem during traversal — the tree walk itself is buffer-pool resident. However, two resource behaviors deserve attention:

Buffer pool pressure from high Heap Fetches. Each heap fetch is a random page read. On spinning disks, random I/O cost is 4–10× sequential. Even on SSDs, a plan generating 10,000 heap fetches across a large table will evict useful pages from shared_buffers, degrading concurrent queries. The ratio of Buffers: shared read to shared hit in EXPLAIN BUFFERS output directly quantifies this pressure.

Index bloat from dead tuples. PostgreSQL’s MVCC model does not immediately reclaim space when rows are updated or deleted. Dead tuple versions accumulate in both the heap and index leaf pages. Index bloat forces the planner to traverse more leaf pages per scan — even when the live row density is low — and inflates cost estimates. Monitor bloat with:

SELECT
  indexrelname,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan,
  idx_tup_read,
  idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'orders'
ORDER BY idx_scan DESC;

Low idx_tup_fetch relative to idx_tup_read is a secondary signal: the engine traversed many index entries but retrieved far fewer heap rows — consistent with high bloat or a poorly selective index condition.

random_page_cost calibration. The planner’s decision to use a B-tree versus a sequential scan is sensitive to random_page_cost. The default is 4.0, calibrated for spinning disk. For NVMe-backed instances, setting random_page_cost = 1.1 to 1.5 corrects the planner’s over-weighting of random I/O and recovers index paths that were incorrectly suppressed:

SET LOCAL random_page_cost = 1.1;
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

Step-by-Step Tuning Workflow #

Step 1 — Capture the baseline plan.

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT user_id, status, created_at
FROM orders
WHERE status = 'pending'
  AND created_at >= '2024-01-01';

Record scan node type, Heap Fetches, Buffers: shared read, and row estimate accuracy.

Step 2 — Inspect column cardinality before redesigning the index.

SELECT attname, n_distinct, correlation, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'orders'
  AND attname IN ('status', 'created_at', 'user_id');

n_distinct: negative values mean PostgreSQL expresses cardinality as a fraction of total rows (e.g., -0.05 = 5% distinct). correlation: values near 1.0 or -1.0 mean the physical row order matches the column order — B-tree range scans are cheapest when correlation is high.

Step 3 — Validate effective_cache_size and random_page_cost.

SHOW effective_cache_size;
SHOW random_page_cost;

On cloud instances with NVMe, random_page_cost is often too high. Adjust per-session to test the planner’s response before making a permanent change.

Step 4 — Rebuild the index with corrected column order.

-- Drop the old index once the replacement is ready
CREATE INDEX CONCURRENTLY idx_orders_status_created
  ON orders (status, created_at);

Use CONCURRENTLY to avoid locking writes during the build.

Step 5 — Add an INCLUDE clause to eliminate heap fetches.

Once the column order is correct, eliminate remaining Heap Fetches by moving projected-only columns into the INCLUDE clause. This enables an Index Only Scan and covering index path without widening the sort keys:

CREATE INDEX CONCURRENTLY idx_orders_status_created_covering
  ON orders (status, created_at)
  INCLUDE (user_id);

Step 6 — Refresh statistics and re-run EXPLAIN.

ANALYZE orders;

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT user_id, status, created_at
FROM orders
WHERE status = 'pending'
  AND created_at >= '2024-01-01';

Confirm: Index Only Scan node, Heap Fetches: 0, and Buffers: shared read near zero.

Step 7 — Drop the old index after confirming the new plan.

DROP INDEX CONCURRENTLY idx_orders_status;

Common Pitfalls #

Function wrapping on indexed columns. Diagnostic signal: Plan shows Seq Scan or a filter that contains date(created_at) or lower(email) in EXPLAIN VERBOSE output, even though the column is indexed. Fix: Create a functional index on the same expression (CREATE INDEX ON orders (date(created_at))), or rewrite the predicate to use a range on the raw column (created_at >= '2024-01-01' AND created_at < '2024-01-02').

Leading wildcard LIKE patterns. Diagnostic signal: LIKE '%value' or LIKE '%value%' appears in the plan filter; the B-tree is bypassed entirely. Fix: B-trees cannot index a leading wildcard. Install the pg_trgm extension and create a GIN trigram index, or restructure the query to use full-text search predicates.

Ignoring index bloat after high-churn workloads. Diagnostic signal: Latency creep over weeks; pg_stat_user_indexes.idx_tup_fetch diverges downward from idx_tup_read; pgstattuple reports dead leaf ratio above 20%. Fix: REINDEX CONCURRENTLY during a maintenance window. Tighten autovacuum thresholds for the table (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_cost_delay = 2ms) to prevent recurrence.

Over-indexing write-heavy tables. Diagnostic signal: High INSERT/UPDATE latency; pg_stat_bgwriter shows sustained checkpoint pressure; pg_stat_user_indexes reveals indexes with idx_scan = 0 over a 30-day window. Fix: Audit and drop zero-scan indexes. Consolidate overlapping indexes into a single index with INCLUDE columns rather than maintaining two separate structures.

Stale statistics causing plan regressions after bulk loads. Diagnostic signal: rows estimate in EXPLAIN diverges from actual rows by 10× or more; the planner chooses a Seq Scan immediately after a large INSERT. Fix: Run ANALYZE table_name explicitly after bulk operations. For very large tables, consider increasing default_statistics_target on high-variance columns:

ALTER TABLE orders ALTER COLUMN created_at SET STATISTICS 500;
ANALYZE orders;

B-tree applied to the wrong operator class. Diagnostic signal: Seq Scan persists despite an index on the filtered column; the predicate uses @>, &&, @@, or geometric operators. Fix: B-trees only index scalar comparison operators. Pivot to specialized index types such as GIN or GiST for containment, full-text, and spatial predicates. The decision between these access methods is detailed on the when to use GIN over B-tree page.

Frequently Asked Questions #

How do I verify if a B-tree index is actually being used?

Run EXPLAIN (ANALYZE, BUFFERS) on the target query. Look for an Index Scan or Index Only Scan node. Check that Heap Fetches are near zero — higher values mean the engine is still visiting the heap for uncovered columns. pg_stat_user_indexes.idx_scan also confirms historical usage across all executions since the last statistics reset.

Why does the planner ignore my B-tree index despite a matching WHERE clause?

The most common causes are: the filtered value is so frequent that a sequential scan genuinely costs less (low selectivity), statistics are stale so the planner misestimates row counts, random_page_cost is set too high for the underlying storage tier, or a function is applied to the indexed column. Run ANALYZE table_name, use SET LOCAL random_page_cost = 1.1 to test the planner’s sensitivity, and verify predicates match the index prefix without wrapping functions.

When should I use INCLUDE columns versus adding them to the main B-tree key?

Use INCLUDE for columns that appear only in the SELECT list but never in WHERE, ORDER BY, or JOIN ON clauses. Placing them in the main key widens every internal node, increases index depth, and forces the planner to sort on columns that carry no filtering benefit. INCLUDE columns are stored only in leaf pages, keeping sort keys lean while enabling Index Only Scan paths.

How frequently should B-tree indexes be rebuilt?

Trigger a REINDEX CONCURRENTLY when pgstattuple reports dead leaf ratio above 20–30%, or after major bulk delete/update operations. For steady-state tables with aggressive autovacuum, annual rebuilds are sufficient. On high-churn tables (e.g., queues or event logs with frequent deletes), monthly monitoring via pgstattuple is prudent.

Index Tuning & Strategy — parent section covering the full index access-path taxonomy
Covering Index Design — detailed guide to eliminating heap fetches with INCLUDE and composite keys
Partial Index Implementation — narrow indexes that reduce size and improve selectivity for filtered subsets
Specialized Index Types (GIN/GiST) — when to abandon B-tree for containment, text search, and spatial queries
Sequential vs Index Scans — cost-model mechanics that govern when the planner selects each scan type