Cost vs Actual Time in EXPLAIN ANALYZE: When the Numbers Don’t Match #
EXPLAIN ANALYZE exposes a friction point that surprises many engineers: the planner’s estimated cost and the executor’s measured wall-clock time often tell completely different stories. This page explains exactly what each number represents, where the gap comes from, and how to close it systematically.
For the broader context of locating the worst node in a plan, see Identifying Plan Bottlenecks.
What the Two Numbers Actually Measure #
The cost value in every plan node is a unitless, optimizer-calculated estimate. It is built from table statistics, index selectivity, and configuration parameters such as seq_page_cost and cpu_tuple_cost. Cost is relative — it exists only to let the planner compare candidate plans against each other. It is not a time prediction.
The actual time pair that EXPLAIN ANALYZE adds is real wall-clock measurement in milliseconds, split into:
- Startup time (first number): how long until the node emits its first row.
- Total time (second number): cumulative time to produce all rows from that node.
These two numbers are measured per iteration of the node. If the node runs inside a loop, you must multiply by loops to get its true contribution to total query time. A node reporting actual time=0.05..0.08 loops=50000 costs 2,500–4,000 ms of wall-clock time — it looks trivial per row, but dominates the query.
The Specific Planner Condition That Causes Divergence #
The root cause is almost always a cardinality estimation error. When pg_statistic data is stale or missing, the planner miscalculates how many rows a scan or join will return. That wrong row count flows upward through the plan tree, corrupting the cost of every operator built on top of it.
Three concrete triggers:
-
Stale statistics:
last_analyzeis far in the past, sopg_statisticno longer reflects the current data distribution. The planner’s selectivity estimates for predicates on those columns are wrong. -
Non-sargable predicates: Functions in
WHEREclauses (WHERE date_trunc('month', created_at) = ...), implicit type casts, or expressions around a column prevent the planner from using the column’s histogram. It falls back to a fixed default selectivity (often 0.5% or 5%) that can miss the real count by orders of magnitude. Understanding how cost estimation models use per-column statistics explains why expression predicates break the model. -
Runtime conditions invisible to the planner: Lock waits, I/O saturation, and buffer cache state all affect wall-clock time but are absent from the cost model. A plan can be correctly estimated and still show high actual time under load.
Annotated EXPLAIN Evidence #
The two fields to focus on when cost and actual time diverge:
EXPLAIN (ANALYZE, BUFFERS)
SELECT o.order_id, SUM(i.quantity * i.unit_price)
FROM orders o
JOIN order_items i ON o.order_id = i.order_id
WHERE o.created_at >= '2024-01-01'
GROUP BY o.order_id;
Hash Join (cost=320.00..8900.00 rows=120 width=40)
(actual time=1.2..3840.5 rows=94000 loops=1)
-- ^^^ rows estimate: 120; actual rows: 94000 — off by 783x.
-- This single mismatch forces the Hash Join to build a
-- table 783x larger than planned, spilling to disk.
Buffers: shared hit=1840 read=22300
-> Seq Scan on orders (cost=0.00..210.00 rows=120 width=16)
(actual time=0.04..180.3 rows=94000 loops=1)
-- ^^^ The stale estimate originates here.
Filter: (created_at >= '2024-01-01')
Rows Removed by Filter: 6000
-> Hash (cost=90.00..90.00 rows=8200 width=28)
(actual time=0.8..0.8 rows=8200 loops=1)
Batches: 4 Memory Usage: 4096kB
-- ^^^ Batches: 4 confirms disk spill. The hash table
-- had to be rebuilt across 4 passes because work_mem
-- was not enough for the actual 94000-row build side.
The two diagnostic fields are rows vs actual rows (cardinality estimation accuracy) and Batches (memory spill indicator). Both are visible in text-format EXPLAIN ANALYZE; the JSON format (FORMAT JSON) additionally exposes Actual Rows, Plan Rows, and Peak Memory Usage as structured fields that tools can parse.
Visualising the Estimation Gap #
The diagram below shows how a row-count error at a base scan propagates upward through a Hash Join and into the final aggregation node, inflating each parent’s actual cost beyond what the planner modelled.
Step-by-Step Resolution Workflow #
Step 1. Capture full node metrics, including buffer I/O:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.order_id, c.customer_name, SUM(i.quantity * i.unit_price)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items i ON o.order_id = i.order_id
WHERE o.created_at >= '2024-01-01'
GROUP BY o.order_id, c.customer_name;
Step 2. Find the node with the worst row estimate ratio. Look for rows (estimate) vs actual rows differing by 10x or more — that node is the origin of the cost divergence.
Step 3. Verify statistics freshness for the table where the estimate is wrong:
SELECT relname,
last_analyze,
last_autoanalyze,
n_live_tup,
n_dead_tup,
n_mod_since_analyze
FROM pg_stat_user_tables
WHERE relname IN ('orders', 'order_items');
-- n_mod_since_analyze >> 0 on a large table means statistics are stale.
Step 4. Refresh statistics on the affected tables, then re-run EXPLAIN ANALYZE:
ANALYZE VERBOSE orders;
ANALYZE VERBOSE order_items;
-- Verify the column histogram was updated:
SELECT attname, n_distinct, most_common_vals, histogram_bounds
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'created_at';
Step 5. If Batches > 1 or Sort Method: external merge appear, test whether increasing work_mem eliminates the spill:
SET LOCAL work_mem = '128MB';
EXPLAIN (ANALYZE, BUFFERS)
SELECT /* your query here */;
-- Compare Batches: before and after.
-- If Batches drops to 1, the spill was the bottleneck.
Step 6. Re-run EXPLAIN (ANALYZE, BUFFERS) after each change and confirm that rows and actual rows converge and that total actual time decreases.
Before/After Plan Comparison #
After ANALYZE refreshed the column statistics:
-- BEFORE (stale statistics)
Seq Scan on orders (cost=0.00..210.00 rows=120 ...)
(actual time=0.04..180.3 rows=94000 ...)
-- 783x row underestimate; downstream Hash Join spills to 4 batches.
-- AFTER (fresh statistics)
Seq Scan on orders (cost=0.00..1240.00 rows=94000 ...)
(actual time=0.04..42.1 rows=94000 ...)
-- Estimate matches reality; Hash Join fits in memory (Batches: 1).
The planner’s cost rose significantly once it knew the true cardinality, but the actual time dropped from 3,840 ms to 42 ms because the correct row count let it allocate enough work_mem and choose an in-memory hash build.
Common Pitfalls #
Treating cost as milliseconds. Cost is a dimensionless comparative heuristic. A cost of 8,900 does not mean 8,900 ms. Equating them leads to misguided tuning decisions.
Ignoring the loops multiplier. Nested loop inner scans show per-iteration times. Multiply actual time by loops to measure the node’s real contribution. This is the single most common misread in plan analysis.
Tuning work_mem without checking Batches first. Raising work_mem globally is expensive. Verify the spill indicator (Batches > 1 for hash nodes, Sort Method: external merge for sort nodes) before changing memory settings. If spill is not the issue, extra memory has no effect.
Fixing cost without re-checking actual time. After running ANALYZE, re-run EXPLAIN ANALYZE to confirm the actual time improved. Statistics refresh changes the plan choice, but the new plan still needs to be validated against real execution.
Frequently Asked Questions #
Why does EXPLAIN ANALYZE show a low cost but high actual time?
Low cost with high actual time indicates inaccurate cardinality estimates. The planner chose an inefficient join strategy or scan based on wrong row count predictions. Stale statistics, missing indexes, or non-sargable predicates force the executor to process far more rows than anticipated. Run ANALYZE on the affected tables to update pg_statistic.
Should I optimize based on cost or actual time?
Always use actual time and actual rows as your primary signal. Cost is a planning heuristic — it guides plan selection but does not reflect real hardware performance, I/O latency, or memory pressure. Use cost only to understand why the planner made a specific choice, then validate any change against actual time.
How do I account for the loops multiplier when reading actual time?
Multiply the per-iteration actual time total by loops. A node showing actual time=0.05..0.08 loops=50000 contributes 2,500–4,000 ms of total wall-clock work despite its tiny per-iteration figure. Missing this multiplier is one of the most common errors when interpreting parallel query execution plans or nested loop plans.
Related #
- Identifying Plan Bottlenecks — parent: locating the highest-cost node in a plan tree
- Reading & Interpreting Query Plans — top-level guide to plan anatomy and diagnostic strategy
- Cost Estimation Models — how PostgreSQL builds the unitless cost figure from statistics
- Parallel Query Execution — how worker counts and Gather nodes affect actual time readings
- Sort and Hash Node Analysis — diagnosing Batches and Sort Method spill indicators in detail