Merge Join vs Nested Loop: Execution Plan Diagnostics #
These two join algorithms handle row combination in fundamentally different ways — one streams sorted inputs in a single pass, the other probes an inner relation once per outer row — and the difference in I/O, memory, and latency between a correct and incorrect choice can span an order of magnitude.
When the optimizer chooses each algorithm #
PostgreSQL’s cost model drives the choice. Understanding the thresholds and eligibility constraints lets you predict plan decisions and intervene when estimates are wrong.
Nested loop eligibility and cost triggers #
A nested loop is eligible for any equijoin or non-equijoin. The optimizer calculates its total cost roughly as:
total_cost ≈ outer_scan_cost + (outer_rows × inner_lookup_cost)
The model favours a nested loop when:
- The outer relation is small after predicate filtering, so
outer_rowsis low. - The inner relation has a highly selective B-tree index on the join key, making
inner_lookup_costclose to O(log N) per probe. - The estimated result set is small enough that the repeated inner lookups stay in the shared buffer cache.
This is why the optimizer typically reaches for a nested loop on OLTP point-lookups: a WHERE orders.id = $1 filter reduces the outer set to one row, and the inner customer lookup hits a primary-key index.
Merge join eligibility and cost triggers #
A merge join requires both inputs to arrive in matching sort order on the join key. The optimizer’s cost model is:
total_cost ≈ sort_cost(outer) + sort_cost(inner) + merge_scan_cost(outer + inner)
When both join columns are covered by B-tree indexes, sort_cost collapses to zero — the index scan delivers rows in sorted order for free. The merge scan itself is O(N + M), so the algorithm dominates at high cardinalities where a nested loop’s O(N × M) cost becomes prohibitive.
The optimizer picks a merge join when:
- Both join column cardinalities are high and neither input is small after filtering.
- Pre-existing B-tree indexes cover the join keys, eliminating the sort step.
- The result set is expected to be large (many matching rows), making the linear scan cheaper than repeated inner lookups.
When statistics are stale or indexes are missing, the optimizer frequently falls back to a hash join instead of a merge join, because building a hash table beats an unindexed sort for moderate-sized inputs. For a deeper look at how the planner weights these options, see cost estimation models.
Annotated EXPLAIN node breakdown #
Nested loop plan #
EXPLAIN (ANALYZE, BUFFERS)
SELECT o.id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2024-01-01';
Nested Loop (cost=0.43..142.50 rows=12 width=64)
(actual time=0.045..0.892 rows=12 loops=1)
Buffers: shared hit=38
-> Index Scan using idx_orders_created on orders o
(cost=0.29..85.20 rows=12 width=32)
(actual time=0.012..0.089 rows=12 loops=1)
Index Cond: (created_at > '2024-01-01'::date)
Buffers: shared hit=5
-> Index Scan using customers_pkey on customers c
(cost=0.14..4.76 rows=1 width=32)
(actual time=0.015..0.016 rows=1 loops=12)
Index Cond: (id = o.customer_id)
Buffers: shared hit=3
Key fields to inspect:
| Field | What it tells you |
|---|---|
loops=12 on inner node |
The inner index scan executed 12 times — once per outer row. Multiply actual time by loops for total inner cost. |
rows=12 at top level |
Estimated and actual row counts match; statistics are healthy. |
Buffers: shared hit=3 per inner loop |
All inner lookups served from shared buffer cache — no disk I/O. |
actual time=0.015..0.016 per loop |
Sub-millisecond inner lookup confirms the primary key index is effective. |
A loops count in the thousands combined with shared read (not hit) on the inner node is the primary signal that a nested loop is causing excessive random I/O.
Merge join plan #
EXPLAIN (ANALYZE, BUFFERS)
SELECT t1.id, t2.payload
FROM events t1
JOIN events t2 ON t1.session_id = t2.session_id
WHERE t1.event_type = 'page_view';
Merge Join (cost=0.87..58420.10 rows=824000 width=48)
(actual time=0.031..1240.55 rows=823891 loops=1)
Merge Cond: (t1.session_id = t2.session_id)
Buffers: shared hit=14820
-> Index Scan using idx_events_session on events t1
(cost=0.43..22000.00 rows=900000 width=24)
(actual time=0.015..310.22 rows=900000 loops=1)
Filter: (event_type = 'page_view'::text)
Rows Removed by Filter: 87430
-> Index Scan using idx_events_session on events t2
(cost=0.43..22000.00 rows=900000 width=24)
(actual time=0.012..305.11 rows=900000 loops=1)
Key fields to inspect:
| Field | What it tells you |
|---|---|
Merge Cond |
The equality predicate both inputs are sorted on. |
No Sort node |
Both inputs arrive pre-sorted from their B-tree index scans — no sort cost. |
loops=1 on both inputs |
Each input is scanned exactly once in a single streaming pass. |
Buffers: shared hit=14820 |
High hit count is expected for a large scan; absence of shared read confirms the data is cached. |
If you see a Sort node feeding into a Merge Join, the inputs were not pre-sorted — adding an index or increasing work_mem to keep the sort in memory are the two levers available.
Algorithm internals #
How nested loop executes #
for each row R in outer_relation:
for each row S in inner_relation WHERE join_condition(R, S):
emit (R, S)
With no inner index, this degrades to a full inner scan per outer row. With a B-tree or hash index on the inner join key, the inner probe becomes a single index lookup, and the algorithm is efficient as long as the outer relation stays small.
The role of caching: Repeated inner probes on a small inner table often stay in the shared buffer cache after the first few loops. Once cache warm, subsequent loops incur only CPU cost, not I/O — which explains why a nested loop with loops=500 and shared hit on every inner probe can still outperform a merge join that requires a large sort step.
How merge join executes #
sort outer_relation on join_key -- skipped when index provides order
sort inner_relation on join_key -- skipped when index provides order
pointer_outer = first(outer)
pointer_inner = first(inner)
while not end(outer) and not end(inner):
if outer.join_key == inner.join_key:
emit all matching combinations
advance both pointers
elif outer.join_key < inner.join_key:
advance outer pointer
else:
advance inner pointer
The algorithm handles duplicate join keys by backing up the inner pointer to the first match for the current outer key and re-emitting all inner matches — a “mark and restore” operation. For high-cardinality join keys with few duplicates, this is negligible. For very low-cardinality keys (e.g. a status column with three distinct values), the back-tracking amplifies cost and a hash join is typically cheaper.
Before/after plan comparison #
Adding a composite index on the join key converts a sort-driven merge join to an index-driven merge join:
-- BEFORE: explicit Sort node
Merge Join (actual time=980.12..3240.55 rows=823891 loops=1)
-> Sort (actual time=820.44..910.22 rows=900000 loops=1)
Sort Key: t1.session_id
Sort Method: external merge Disk: 42MB
-> Seq Scan on events t1 ...
-- AFTER: index provides sort order, Sort node eliminated
Merge Join (actual time=0.031..1240.55 rows=823891 loops=1)
-> Index Scan using idx_events_session on events t1 ...
-> Index Scan using idx_events_session on events t2 ...
Sort Method: external merge Disk is the clearest diagnostic signal that a sort spilled to disk — the fix is either a covering index or a higher work_mem setting.
Visual: algorithm data flow #
Memory, I/O, and resource behavior #
Nested loop resource profile #
A nested loop’s memory footprint is minimal — it holds only the current outer row and the current inner result. The I/O profile depends entirely on whether inner lookups hit the shared buffer cache:
- Cache-warm inner table: Each inner probe costs only a few CPU cycles. This is the best case and is common when the inner table is small relative to
shared_buffers. - Cache-cold inner table: Each inner probe generates random disk reads. At
loops=15000, 15,000 random 8 KB block reads at even 0.1 ms each is 1.5 seconds of I/O — a plan that looks cheap in estimates becomes catastrophically slow in production.
Monitor Buffers: shared read on the inner node. If shared read is non-zero and loops is high, the inner table is not fitting in cache and the nested loop is generating avoidable random I/O. This is the most common nested loop performance trap in sequential vs index scan investigations.
Merge join resource profile #
The merge join itself uses minimal memory — it only needs to buffer rows that share the same join key value for the mark-and-restore operation. The expensive resource is the sort step, if one is needed:
- Sort within
work_mem: The sort runs in memory and adds moderate latency.Sort Method: quicksortorSort Method: top-N heapsortin the plan confirms an in-memory sort. - Sort spills to disk: When the input exceeds
work_mem, PostgreSQL writes sort runs to temporary files on disk.Sort Method: external merge Disk: 42MBin the plan is the diagnostic signal. The fix is either adding an index that provides pre-sorted order (which eliminates the Sort node entirely) or raisingwork_memfor the session.
Check temporary file usage with:
SELECT query, temp_blks_written
FROM pg_stat_statements
WHERE query ILIKE '%your_table%'
ORDER BY temp_blks_written DESC
LIMIT 10;
High temp_blks_written confirms sort spills. Cross-reference with sort and hash node analysis for a full treatment of sort spill diagnostics.
Step-by-step tuning workflow #
Step 1: Capture a baseline with full instrumentation.
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT o.id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2024-01-01';
Save the output. Note the join node type, loops on the inner node, and shared read vs shared hit at each level.
Step 2: Check actual vs estimated row counts on both join inputs.
If estimated rows are far below actual rows on either input, the optimizer received bad cardinality signals. Run:
SELECT tablename, attname, n_distinct, correlation, null_frac
FROM pg_stats
WHERE tablename IN ('orders', 'customers')
AND attname IN ('customer_id', 'created_at', 'id');
Stale or skewed statistics cause the optimizer to underestimate input sizes and prefer nested loops for inputs that are actually large.
Step 3: Update statistics and recheck.
ANALYZE orders;
ANALYZE customers;
Then re-run the EXPLAIN (ANALYZE, BUFFERS) query and compare join node choice and row estimates.
Step 4: Verify index coverage on join columns.
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename IN ('orders', 'customers')
ORDER BY tablename, indexname;
Confirm that both join columns (customer_id on orders, id on customers) have usable B-tree indexes. For a merge join, also confirm the index provides the required sort direction.
Step 5: Test alternative join strategies with SET LOCAL.
BEGIN;
-- Force merge join and measure
SET LOCAL enable_nestloop = off;
SET LOCAL enable_hashjoin = off;
EXPLAIN (ANALYZE, BUFFERS)
SELECT o.id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2024-01-01';
ROLLBACK;
Compare the actual times from the forced plan against your baseline. SET LOCAL scopes the change to the current transaction, which makes it safe to test in production read replicas.
Step 6: Raise work_mem if a sort node is spilling.
BEGIN;
SET LOCAL work_mem = '256MB';
EXPLAIN (ANALYZE, BUFFERS)
SELECT /* your query */ ...;
ROLLBACK;
If the sort moves from external merge Disk to quicksort Memory, the sort was the bottleneck. Set work_mem at the session level for queries that consistently require large sorts, rather than raising it globally.
Step 7: Add indexes where the plan confirms they are missing.
If both join inputs are large, cardinality is high, and no B-tree index covers the join keys, add them:
CREATE INDEX CONCURRENTLY idx_orders_customer
ON orders (customer_id);
CREATE INDEX CONCURRENTLY idx_customers_id
ON customers (id);
CONCURRENTLY avoids a table lock in production. Re-run EXPLAIN (ANALYZE, BUFFERS) after the index builds to confirm the plan shifts.
Common pitfalls #
Implicit type cast silently disabling index use. When the join columns have mismatched types (e.g. orders.customer_id is integer and customers.id is bigint), PostgreSQL inserts an implicit cast that prevents index usage on one side. Diagnostic signal: the plan shows a Seq Scan on the inner node even though an index exists. Fix: align column types with ALTER TABLE ... ALTER COLUMN ... TYPE ... or add explicit casts that the index can satisfy.
Cardinality underestimation forcing nested loop on large inputs. If pg_stats is stale or the column has a non-uniform distribution, the optimizer sees a low row estimate and chooses a nested loop. Diagnostic signal: rows=50 estimated but rows=150000 actual on the outer node. Fix: ANALYZE the tables and raise default_statistics_target for skewed columns:
ALTER TABLE orders ALTER COLUMN customer_id
SET STATISTICS 500;
ANALYZE orders;
Sort step cost exceeding hash join cost. Forcing a merge join without verifying that indexes provide pre-sorted order can produce a plan with a large sort step that is slower than a hash join. Diagnostic signal: a Sort node appears between the merge join and its input, and Sort Method: external merge Disk confirms a spill. Fix: add a B-tree index on the join key or switch to allowing hash joins.
Nested loop inner table not in cache at scale. A nested loop that performs well in a development environment with a small dataset produces catastrophic random I/O in production when the inner table grows beyond shared_buffers. Diagnostic signal: Buffers: shared read is high per inner loop and actual time is much higher in production than in tests. Fix: add an index if missing, or allow the optimizer to choose a merge or hash join for large inputs by running ANALYZE.
ORM lazy loading simulating nested loops at the application layer. ORMs that issue one query per outer row replicate nested loop behaviour outside the database, where the optimizer cannot intervene. Diagnostic signal: pg_stat_statements shows hundreds of nearly identical single-row lookups per second. Fix: use explicit JOIN clauses or JOIN FETCH (JPA/Hibernate), select_related (Django), or includes (ActiveRecord) to move the join into the database.
Work_mem set globally instead of per-query. Raising work_mem globally to fix merge join sort spills can exhaust server memory under concurrent load, since each sort step in each connection gets the full allocation. Fix: use SET LOCAL work_mem = '...' in a transaction or target specific sessions, and monitor pg_stat_activity and OS memory usage before raising the global setting.
Frequently Asked Questions #
When should I prefer a nested loop over a merge join? #
Prefer a nested loop when the outer table is small after filtering and the inner table has a highly selective index on the join key. The algorithm is optimal for OLTP point-lookup patterns where the outer result set is small and every inner lookup hits a cached index page. If loops stays below a few hundred and shared read is zero on the inner node, the nested loop is performing well.
How do I prevent disk spills during a merge join sort step? #
Increase work_mem so the sort fits in memory, or add a B-tree index that provides the required sort order and eliminates the Sort node entirely. A composite index covering both the join predicate and any ORDER BY clause can remove the sort step completely, which is the more robust long-term fix.
Can I safely force a specific join algorithm in PostgreSQL? #
Use SET LOCAL enable_nestloop = off or SET LOCAL enable_mergejoin = off only for diagnostics or short-lived workarounds. Permanent overrides bypass the cost-based optimizer and will degrade performance as data distributions and table sizes evolve. Always investigate why the optimizer made the wrong choice — usually stale statistics or a missing index — and fix the root cause.
Why does PostgreSQL choose a nested loop even when both tables are large? #
Cardinality underestimation is the most common cause. When pg_stats is stale or default_statistics_target is too low, the optimizer sees artificially small row estimates and concludes a nested loop will be cheap. Running ANALYZE and raising statistics resolution for skewed columns usually shifts the plan to a merge or hash join.
Related #
- Hash Join Mechanics — when the optimizer builds a hash table instead of sorting or probing an index, and how to tune batch count and memory usage
- Sequential vs Index Scans — how scan type choice on each join input feeds into join algorithm selection
- Cost Estimation Models — the planner arithmetic behind join cost comparisons
- Sort and Hash Node Analysis — diagnosing sort spills and hash batch overflows that appear inside join plans
- Identifying Plan Bottlenecks — systematic approach to finding the hot node in any execution plan