When Does the Optimizer Choose a Hash Join? #
The query optimizer picks a join algorithm for every pair of relations it combines. The choice between a hash join, a nested loop, and a merge join is made by the cost model on each specific query — not by a fixed rule. This page isolates the cost-model thresholds, predicate eligibility constraints, and memory conditions that lead PostgreSQL to select a hash join. For the full build/probe lifecycle and spill mechanics, see Hash Join Mechanics.
The Cost-Model Condition That Triggers a Hash Join #
The optimizer selects a hash join when the estimated cost of building an in-memory hash table and probing it is lower than the alternatives:
Hash join vs nested loop. Building the hash table is O(N) work; probing it is O(1) per probe row. A nested loop without an inner index is O(N × M). Once the inner table is large enough that repeated random lookups exceed the hash table build cost, the planner switches to a hash join. Even when an inner index exists, a large filtered result set often makes index traversal more expensive than a single sequential scan and hash build — this is why the optimizer can choose a hash join even when indexes are present on both sides of the join key.
Hash join vs merge join. A merge join requires both inputs in sorted order. If neither input is already sorted on the join key, the planner must add an explicit sort step. When the cost of that sort plus the merge exceeds the cost of a hash build and probe, the optimizer chooses the hash join instead.
The cost model also depends on accurate statistics. When the planner underestimates row counts — due to stale pg_stats histograms or a data skew not captured by the default 100-bucket histogram — it may choose a nested loop for a join that would benefit from hashing, or vice versa.
Hard eligibility constraints. Two conditions must hold before the planner will even consider a hash join:
- The join condition must be an equality predicate (
=). Hash joins cannot evaluate range conditions such as<orBETWEEN. - Both sides of the join key must have compatible data types. An implicit cast in the
Hash Condprevents a hash join from being chosen.
Annotated EXPLAIN Evidence #
The plan below shows the typical fields to inspect when diagnosing a hash join selection:
EXPLAIN (ANALYZE, BUFFERS)
SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2023-01-01';
Hash Join (cost=1250.00..8900.00 rows=45000 width=128)
(actual time=45.2..120.5 rows=44890 loops=1)
Hash Cond: (o.customer_id = c.id) -- equality predicate: hash join eligible
-> Seq Scan on orders o (cost=0.00..1100.00 rows=45000 width=64)
(actual time=1.2..45.0 rows=44890 loops=1)
Filter: (created_at > '2023-01-01'::date)
-> Hash (cost=100.00..100.00 rows=5000 width=64)
(actual time=15.3..15.3 rows=5000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 312kB
-- ^^^^ ^^^^^^
-- no spill fits in work_mem
-> Seq Scan on customers c (cost=0.00..100.00 rows=5000 width=64)
(actual time=0.1..10.2 rows=5000 loops=1)
Key fields:
Hash Cond— must be an=predicate; any implicit cast here (e.g.o.customer_id::bigint = c.id) can block hash join selection.Batches: 1— the entire hash table fit inwork_mem. No disk spill occurred.Memory Usage: 312kB— actual hash table size; compare against yourwork_memsetting.rows=5000on theHashnode vsrows=45000on the probe side — the planner correctly chose the smaller relation as the build side.actual rowsvsrows— they match closely here, confirming statistics are accurate.
The SVG below illustrates how the build and probe phases map onto the plan tree:
Step-by-Step Resolution Workflow #
Use these steps whenever you see an unexpected hash join, a hash join that is spilling to disk, or a query where you suspect the optimizer chose the wrong join algorithm.
1. Confirm the plan and baseline cost.
EXPLAIN (ANALYZE, BUFFERS)
SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2023-01-01';
Record actual time, Batches, and Memory Usage from the Hash node.
2. Refresh statistics on both tables.
Stale statistics are the most common cause of a wrong join algorithm selection. The planner’s cost estimates feed directly from pg_stats histograms.
ANALYZE orders;
ANALYZE customers;
Re-run EXPLAIN (ANALYZE, BUFFERS) and check whether estimated rows now matches actual rows.
3. Check for implicit casts in the Hash Cond.
Run the plan with EXPLAIN (ANALYZE, VERBOSE) and inspect the Hash Cond line. Any cast expression (e.g. (o.customer_id)::bigint = c.id) blocks the hash join. Fix the schema so the types match without casting.
4. If Batches > 1, increase work_mem and re-test.
SET LOCAL work_mem = '256MB';
EXPLAIN (ANALYZE, BUFFERS)
SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2023-01-01';
Watch Batches and Memory Usage in the Hash node. If Batches drops to 1, the spill is resolved.
5. Verify the build side is the smaller relation.
The Hash node should wrap the smaller relation. If statistics are wrong and the planner builds from the larger side, refreshing statistics (step 2) corrects this. You can also query pg_stats to inspect the planner’s view:
SELECT tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE tablename IN ('orders', 'customers')
AND attname IN ('customer_id', 'id');
6. Compare alternatives by disabling hash join.
SET LOCAL enable_hashjoin = off;
EXPLAIN (ANALYZE, BUFFERS)
SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2023-01-01';
Compare total actual time between both plans. If the hash join is still faster, the original plan was correct and the issue is elsewhere (statistics or memory pressure). If the alternative is faster, trace back to the root cause through statistics accuracy, data skew, or predicate eligibility.
Before/After Plan Comparison #
The most common fix — increasing work_mem to eliminate a spill — produces a measurable change in the Hash node output:
-- Before (spilling to disk):
Hash (actual time=95.0..95.0 rows=120000 loops=1)
Buckets: 65536 Batches: 4 Memory Usage: 4096kB
-- After (SET LOCAL work_mem = '64MB'):
Hash (actual time=38.2..38.2 rows=120000 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 18432kB
Batches dropped from 4 to 1, bucket count doubled (more granular distribution), and build time fell from 95 ms to 38 ms. The actual time on the Hash Join node will show a proportionally larger improvement because the probe phase no longer needs to re-read partitioned temp files.
Common Pitfalls #
Assuming indexes on the join key always prevent hash join selection. The planner evaluates the full cost of each path. A large, unfiltered join with index access still incurs per-row random I/O; a single sequential scan plus a hash build is often cheaper. Seeing a hash join despite indexes present is normal — investigate whether the index would actually be used selectively.
Attributing slow performance to hash join overhead when spill is the real cause. A hash join with Batches: 1 is typically very fast. When Batches > 1, the bottleneck is disk I/O from temp file writes, not the hash algorithm itself. Check Batches before tuning anything else.
Setting work_mem globally to fix one query. Increasing work_mem system-wide multiplies the memory allocation across all concurrent sessions. For heavy workloads, use SET LOCAL work_mem within a specific transaction, or set it at the role level for dedicated reporting connections.
Implicit type casts silently breaking hash join eligibility. If a join column was added later with a different type (e.g. integer vs bigint), the implicit cast appears in EXPLAIN VERBOSE output but is easy to overlook in a normal EXPLAIN. Always check Hash Cond in verbose mode when debugging unexpected join algorithm selection.
Frequently Asked Questions #
Why does the optimizer choose a hash join even when indexes exist on both join columns? #
Indexes do not guarantee faster joins. When the filtered result set is large — typically above 10–20% of the table — sequential scans combined with a hash join outperform thousands of random index lookups. The planner compares the full cost of each strategy, including per-row index I/O overhead, and selects the lowest-cost path.
How can I tell if a hash join is spilling to disk? #
Check EXPLAIN (ANALYZE) output for Batches: 2 or higher in the Hash node, alongside a Memory Usage value close to your work_mem setting. Each additional batch level means an additional disk pass and roughly doubles the work performed. You can also detect spill across multiple queries by checking temp_blks_written in pg_stat_statements for the relevant query.
Can I force the optimizer to avoid hash joins? #
Yes: SET LOCAL enable_hashjoin = off disables hash joins for the current transaction. Use this only for diagnostic comparison — it removes a highly effective join path from the planner’s cost model and is not a production solution. If a hash join is consistently suboptimal, the root cause is almost always stale statistics, a data type mismatch, or memory pressure from spill.
Related #
- Hash Join Mechanics — full build/probe lifecycle, spill behavior, and
work_memsizing - Merge Join vs Nested Loop — when the optimizer chooses the alternatives
- Cost Estimation Models — how the planner assigns unitless cost units to each plan node
- Execution Plan Fundamentals — the operator tree, join algorithm taxonomy, and plan-reading foundations