Cost Estimation Models in Database Query Optimization #

The PostgreSQL cost estimator translates expected I/O patterns, CPU cycles, and data distribution into a single dimensionless cost metric — and the planner picks whichever execution path scores lowest.

When the optimizer relies on these cost calculations #

Cost estimation drives every access-path decision in execution plan fundamentals: whether to scan a table sequentially or through an index, which join algorithm to use, and in what order to join tables. The planner evaluates all candidate plans, assigns each a cost, and commits to the cheapest one — without executing a single row.

Two eligibility constraints determine whether the cost model can make a good decision:

  1. Statistics currency. The planner reads precomputed histograms and most-common-values (MCV) lists from pg_stats. If ANALYZE has not run since the last bulk load, histograms reflect stale data distributions and estimates will drift.
  2. Cardinality independence assumption. Single-column statistics assume each column’s predicate is statistically independent of others. When a query filters on correlated columns — for example WHERE region = 'EMEA' AND status = 'active' — single-column histograms compound the error multiplicatively.

When estimates are accurate, the planner makes near-optimal choices across sequential vs index scans, hash join mechanics, and join reordering. When they drift, the planner may force a full table scan on a 200 M-row table where an index seek would cost 1000× less.

Annotated EXPLAIN node breakdown #

Run this command against any slow query to surface the fields that matter most for cost diagnosis:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, c.email
FROM   orders o
JOIN   customers c ON o.customer_id = c.id
WHERE  o.created_at > '2024-01-01'
  AND  c.region = 'EMEA';

A representative plan output with annotation targets:

Hash Join  (cost=4231.00..18940.52 rows=312 width=24)
           (actual time=42.3..310.7 rows=9841 loops=1)
  -- ^^^ rows=312 estimated, 9841 actual: 31× drift → bad cardinality
  -- cost=startup..total (unitless planner metric, not milliseconds)
  Buffers: shared hit=1820 read=6104
  -- ^^^ shared read >> shared hit: more disk I/O than cache → effective_cache_size too optimistic
  ->  Seq Scan on customers c  (cost=0.00..8410.00 rows=41800 width=16)
        (actual time=0.1..28.4 rows=41800 loops=1)
        Filter: (region = 'EMEA')
        Rows Removed by Filter: 358200
        -- ^^^ 89% rows discarded: low selectivity, but planner knew this
  ->  Hash  (cost=3880.00..3880.00 rows=28080 width=12)
            (actual time=38.1..38.1 rows=28080 loops=1)
      ->  Seq Scan on orders o  (cost=0.00..3880.00 rows=28080 width=12)
            Filter: (created_at > '2024-01-01')
            Rows Removed by Filter: 171920

Key fields to inspect on every node:

Field What it reveals
rows (plan) vs actual rows Cardinality accuracy — gap > 10× warrants ANALYZE or extended statistics
cost=startup..total Unitless planner weight; compare siblings to find the expensive subtree
Buffers: shared hit/read Cache vs disk split; high read means effective_cache_size is underestimated
Rows Removed by Filter Filter selectivity — large removals on a Seq Scan suggest a missing index
loops Node executes this many times; multiply actual rows by loops for true cardinality

Cost model internals #

PostgreSQL decomposes plan execution into discrete operations and weights each with a configuration parameter:

Parameter Default Meaning
seq_page_cost 1.0 Baseline cost to read one page sequentially
random_page_cost 4.0 Cost to read one page via random I/O
cpu_tuple_cost 0.01 Cost to process one heap tuple
cpu_index_tuple_cost 0.005 Cost to process one index entry
cpu_operator_cost 0.0025 Cost to evaluate one operator or function

The 4:1 ratio between random_page_cost and seq_page_cost encodes traditional spinning-disk assumptions: a random seek costs roughly four sequential reads. On NVMe or SSD, that ratio collapses to 1.1–1.5:1. When the parameter still reads 4.0, the planner systematically overvalues sequential scans and undervalues index seeks.

The formula for a simple sequential scan node is:

seq_scan_cost = (pages_read × seq_page_cost) + (rows_read × cpu_tuple_cost)

For an index scan node the planner adds the index-descent cost and accounts for random page fetches per qualifying row:

index_scan_cost = (index_pages × random_page_cost)
                + (heap_fetches × random_page_cost × index_correlation_factor)
                + (index_tuples × cpu_index_tuple_cost)
                + (heap_tuples × cpu_tuple_cost)

effective_cache_size enters as a signal — not a memory allocation — telling the planner how large the OS page cache is. A larger value shifts the estimated probability that a random page is already cached, reducing effective random_page_cost. See how PostgreSQL calculates node costs for the exact per-node formulas.

The SVG below maps how these parameters flow into path-cost comparison:

PostgreSQL cost model decision flow Diagram showing how seq_page_cost, random_page_cost, cpu costs, statistics, and effective_cache_size feed into path cost calculations and planner path selection. seq_page_cost = 1.0 random_page_cost = 4.0 cpu_tuple_cost cpu_operator_cost pg_stats histograms + MCV lists effective_cache_size CREATE STATISTICS (multi-column) Path Cost Calculator (per candidate path) Seq Scan cost Index Scan cost Hash Join cost lowest Chosen Plan (minimum cost wins) INPUTS ESTIMATION CANDIDATES

Memory, I/O, and resource behavior #

Cost estimation directly shapes memory allocation decisions. When the planner underestimates the number of rows flowing into a hash join, it allocates insufficient work_mem. At runtime, the hash table grows beyond that budget and PostgreSQL spills it to a temporary file — often multiplying execution time by 5–20×.

The Batches field in an EXPLAIN (ANALYZE) hash join node is the definitive signal:

-- Batches: 1  → in-memory, no spill
-- Batches: 8  → eight passes over the probe side, heavy disk I/O
Hash  (cost=28000.00..28000.00 rows=224000 width=32)
      (actual time=1840.2..1840.2 rows=224000 loops=1)
  Buckets: 262144  Batches: 8  Memory Usage: 4096kB

If Batches > 1, the row estimate was either accurate but work_mem is too low, or the estimate was wrong and you need statistics tuning. Distinguish the two cases by checking whether rows (plan) ≈ actual rows first — if they match, raise work_mem; if they diverge, fix the statistics.

effective_cache_size influences the planner’s index-vs-seq-scan choice by modelling OS cache coverage. On systems where shared buffers + OS page cache covers most of the working set, set effective_cache_size to 75 % of total RAM. The planner does not allocate that memory — it uses the value only to adjust random-page cost during estimation.

Step-by-step tuning workflow #

Step 1 — Capture the baseline.

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT /* your slow query here */;
-- Record: Plan Rows, Actual Rows, Buffers shared read/hit, total actual time

Step 2 — Identify the node with the largest row-estimate gap.

A drift ratio ≥ 10× on any node is the primary target. Calculate it as actual rows / plan rows (or the inverse when the plan over-estimates).

Step 3 — Audit statistics for the filtered columns.

SELECT attname,
       n_distinct,
       array_length(most_common_vals::text[], 1)  AS mcv_count,
       array_length(histogram_bounds::text[], 1)  AS histogram_buckets
FROM   pg_stats
WHERE  tablename = 'orders'
  AND  attname IN ('created_at', 'status', 'region');
-- mcv_count NULL and histogram_buckets = 100 → increase statistics target
-- n_distinct wrong sign → ANALYZE has not run since schema change

Step 4 — Refresh statistics, raise target if needed.

-- Raise histogram resolution for a high-cardinality date column
ALTER TABLE orders ALTER COLUMN created_at SET STATISTICS 500;
ANALYZE orders;

Step 5 — Add extended statistics for correlated predicates.

-- Capture joint distribution between region and status
CREATE STATISTICS orders_region_status ON region, status FROM orders;
ANALYZE orders;
-- Confirm the statistic exists
SELECT stxname, stxkind FROM pg_statistic_ext WHERE stxrelid = 'orders'::regclass;

Step 6 — Adjust cost parameters to match hardware (session-scoped first).

SET LOCAL random_page_cost    = 1.1;   -- SSD or NVMe
SET LOCAL effective_cache_size = '12GB'; -- reflect OS page cache + shared_buffers
EXPLAIN (ANALYZE, BUFFERS)
SELECT /* same query */;
-- Confirm: Seq Scan → Index Scan transition and reduced shared read count

If the session-scoped override gives better plans, apply the change at database or server level only after validating no other workload regresses.

Common pitfalls #

Treating cost as milliseconds. Cost is a unitless planner weight. A node with cost=50000 does not take 50 seconds. Compare costs across siblings in the same plan tree; never compare across query runs or databases.

Ignoring column correlation. Single-column histograms assume predicates are independent. A query like WHERE country = 'DE' AND city = 'Berlin' will be underestimated because city is not independent of country. Fix with CREATE STATISTICS ... (ndistinct, dependencies).

Lowering random_page_cost without benchmarking. Set it to 1.1 for NVMe as a starting point, but verify the actual random-vs-sequential I/O ratio on your storage tier with pg_test_timing or an I/O benchmark before making it permanent.

Running EXPLAIN without ANALYZE. Plan-only EXPLAIN shows estimated rows and cost but hides actual rows, Buffers, and Batches. You cannot diagnose cardinality drift without ANALYZE.

Global statistics target changes instead of per-column. Raising default_statistics_target from 100 to 500 site-wide increases ANALYZE time and catalog size proportionally. Target only the columns that appear in WHERE, JOIN ON, or GROUP BY clauses with proven estimate drift.

Missing a VACUUM when rows are dead. ANALYZE samples live tuples. If a table has millions of dead tuples from bulk deletes, n_live_tup in pg_stat_user_tables will be inflated, leading to over-estimated table size. Run VACUUM ANALYZE in this case.

Frequently Asked Questions #

What does ‘cost’ actually measure in an execution plan? Cost is a dimensionless planner metric representing the estimated I/O and CPU resources required to execute a plan node. It is calculated using weighted parameters for sequential reads, random reads, tuple processing, and operator evaluation. It does not map to wall-clock time but serves as a relative comparison metric — the planner always picks the path with the lowest total cost.

How do I fix large gaps between estimated and actual rows? Run ANALYZE on the affected tables, increase default_statistics_target for skewed columns, and create extended statistics for correlated predicates. If the gap persists, query pg_stats to verify histogram buckets and MCV lists, then consider partial indexes or query rewrites to expose clearer cardinality boundaries to the planner.

Do cost models account for NVMe or SSD storage? Default cost parameters assume spinning-disk latency. On flash storage, random I/O penalty shrinks dramatically. Lower random_page_cost to 1.1–1.5 and set effective_cache_size to reflect your OS page cache to align the cost model with actual hardware characteristics.

When should I use ANALYZE vs VACUUM ANALYZE? Use ANALYZE when you only need to refresh statistics for the cost model. Use VACUUM ANALYZE when the table has experienced heavy DELETE or UPDATE activity and requires dead tuple cleanup alongside statistics refresh. For read-heavy or insert-only workloads, ANALYZE alone is sufficient and less disruptive.


Related