Hash Join Mechanics: How PostgreSQL Builds and Probes Hash Tables #

A hash join is PostgreSQL’s dominant algorithm for equi-joining two large relations when neither input carries a useful index on the join key — it builds an in-memory hash table from the smaller input and probes it with every row of the larger input, delivering O(N+M) throughput instead of the O(N×M) worst case of a naive scan.

When the Optimizer Chooses a Hash Join #

PostgreSQL’s cost model selects a hash join when three conditions hold simultaneously: the join condition is an equality predicate (hash joins do not support range predicates), no highly selective index is available for an inner-relation lookup, and the estimated cardinality of at least one input is large enough that repeated index lookups — the strategy favored by a nested loop or merge join — would be more expensive than materializing a hash table.

The planner compares hash_join_cost against the alternatives it considers for that pair of relations. Its estimate of hash_join_cost depends directly on the build-side row count from cost estimation models — accurate statistics are therefore load-bearing for this decision. Stale statistics that undercount the build side can produce an unexpectedly cheap hash join estimate that later spills to disk.

The optimizer will also prefer a hash join when:

Understanding the decision boundary is central to execution plan fundamentals — when you see an unexpected join strategy, the first check is always the optimizer’s cardinality estimate, not the join algorithm itself.

Annotated EXPLAIN Node Breakdown #

Run EXPLAIN (ANALYZE, BUFFERS) to see every diagnostic field the hash join exposes:

EXPLAIN (ANALYZE, BUFFERS)
SELECT o.order_id, c.customer_name
FROM   orders o
JOIN   customers c ON o.customer_id = c.id
WHERE  o.created_at > '2024-01-01';

A representative plan output with annotations:

Hash Join  (cost=1820.00..9340.00 rows=48000 width=36)
           (actual time=23.4..210.8 rows=51200 loops=1)
  -- ① cost= is a unitless planner metric; actual time= is wall-clock ms
  -- ② rows estimate (48000) is close to actual (51200) — good statistics
  Hash Cond: (o.customer_id = c.id)
  Buffers: shared hit=6240 read=1120
  -- ③ 6240 blocks from cache, 1120 from disk — ratio is 85 % hit
  ->  Seq Scan on orders o  (cost=0.00..5800.00 rows=240000 width=24)
                             (actual time=0.1..88.3 rows=241600 loops=1)
        Filter: (created_at > '2024-01-01'::date)
        Rows Removed by Filter: 38400
  ->  Hash  (cost=820.00..820.00 rows=40000 width=20)
             (actual time=18.2..18.2 rows=40000 loops=1)
        Buckets: 65536  Batches: 1  Memory Usage: 2816kB
        -- ④ Batches: 1 → hash table fit entirely in work_mem — no spill
        -- ⑤ Memory Usage: 2816 kB — compare against work_mem setting
        -- ⑥ Buckets: 65536 — chosen by planner from estimated row count
        ->  Seq Scan on customers c  (cost=0.00..820.00 rows=40000 width=20)
                                      (actual time=0.1..9.4 rows=40000 loops=1)

Key fields to inspect in every hash join plan:

Field Where it appears What it tells you
Batches Hash child node 1 = fully in-memory; > 1 = spilled to disk
Memory Usage Hash child node Actual hash table size in kB; compare to work_mem
Buckets Hash child node Number of hash buckets; sized from estimated build rows
rows (estimated) Hash Join node Planner’s output row count prediction
rows (actual) Hash Join node True output rows; large divergence = stale statistics
Buffers: read Any node Blocks read from disk; high values indicate I/O pressure

Algorithm Internals: Build and Probe Phases #

The hash join lifecycle has two phases. Understanding both is necessary for diagnosing anomalies in EXPLAIN ANALYZE output.

Hash Join Build and Probe Phases Diagram showing the two-phase hash join: the build phase scans the smaller relation and inserts rows into a hash table; the probe phase scans the larger relation and looks up matches in the hash table. Phase 1 — Build Smaller Relation (build input) hash(key) Hash Function Hash Table bucket 0: [row…] bucket 1: [row…] bucket 2: [row…] bounded by work_mem Phase 2 — Probe Larger Relation (probe input) hash(key) Lookup Match in table? yes Emit joined row no match → skip next row Result Set

Build phase — the optimizer designates the smaller input as the build side. PostgreSQL scans it once, computes hash(join_key) for every row, and inserts each row into the in-memory hash table at the corresponding bucket. This scan is done before the probe side is touched, which is why the Hash child node in the plan completes entirely before the Hash Join parent begins reading the probe input.

Probe phase — PostgreSQL scans the larger input row by row. For each row it computes the same hash function on the join key, identifies the target bucket, and walks the bucket’s chain looking for rows whose key values match exactly. Matching pairs are emitted immediately — no buffering. Because only one scan of each input is required, the algorithm is linear in the combined input size: O(N + M).

Eligibility constraint — the join condition must be a strict equality predicate. Range predicates (e.g., a.val BETWEEN b.lo AND b.hi) cannot be resolved by bucket lookup and disqualify the hash join. In those cases the optimizer falls back to a nested loop or merge join.

Before/After Plan Comparison: Spill Eliminated #

-- Before: stale statistics caused a 15 000-row estimate on a 1.2 M-row build side
Hash  (cost=520.00..520.00 rows=15000 width=20)
      (actual time=2840.1..2840.1 rows=1200000 loops=1)
  Buckets: 16384  Batches: 16  Memory Usage: 10024kB
-- After: ANALYZE corrected the estimate; work_mem raised to 256 MB
Hash  (cost=9800.00..9800.00 rows=1180000 width=20)
      (actual time=290.4..290.4 rows=1200000 loops=1)
  Buckets: 2097152  Batches: 1  Memory Usage: 84736kB

The transition from Batches: 16 to Batches: 1 eliminated 15 round-trips to temporary files on disk and reduced actual time from ~4 200 ms to ~320 ms.

Memory, I/O, and Resource Behavior #

Hash joins are the most memory-sensitive join algorithm in PostgreSQL. The optimizer sizes the hash table from the estimated build-side row count and average row width. When that estimate is accurate and work_mem is sufficient, the entire hash table lives in memory and the join completes in two passes over the data.

When the hash table overflows work_mem — PostgreSQL switches to a multi-batch strategy called “hybrid hash join”:

  1. Both inputs are partitioned into batches using a secondary hash function applied to the join key.
  2. Batch 0 is processed in-memory immediately.
  3. All remaining batch partitions are written to temporary files.
  4. PostgreSQL loops: for each remaining batch, it reads the build partition into memory, then streams the matching probe partition through it.

Each extra batch pair requires two sequential I/O passes: one write and one read per side. With Batches: 8, every row on both sides touches disk twice (once written, once re-read), multiplying I/O by roughly 4× versus a single-pass in-memory join.

work_mem is per-operation, not per-connection. A query with three hash join nodes may allocate up to 3 × work_mem. Be conservative when raising work_mem globally; prefer SET LOCAL work_mem = '...' scoped to the session or transaction.

Checking temporary file writes:

-- Confirm temporary file usage for the current backend during a long-running query
SELECT pid, query, temp_files, temp_bytes
FROM   pg_stat_activity
JOIN   pg_stat_user_tables USING (relid)  -- illustrative only
WHERE  state = 'active';

-- Or inspect pg_stat_statements after the fact (requires the extension)
SELECT query, calls, temp_blks_written, total_exec_time / calls AS avg_ms
FROM   pg_stat_statements
WHERE  temp_blks_written > 0
ORDER  BY temp_blks_written DESC
LIMIT  10;

Step-by-Step Tuning Workflow #

Follow these steps in order when you see Batches > 1 or unexpectedly high actual time on a hash join node.

Step 1 — Capture a baseline plan with full metrics

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.order_id, c.customer_name
FROM   orders o
JOIN   customers c ON o.customer_id = c.id
WHERE  o.created_at > '2024-01-01';
-- Save this output; you will compare it against every subsequent run.

Step 2 — Inspect the Hash node for spill signals

Look at the Hash child node. Record Batches, Memory Usage, and compare rows (estimated) to rows (actual). A Batches > 1 value confirms a spill. A rows estimate that is 10× or more below the actual count points to stale statistics as the root cause.

Step 3 — Check column-level statistics

-- Review statistics for the build-side join column
SELECT attname,
       n_distinct,
       correlation,
       most_common_vals,
       most_common_freqs
FROM   pg_stats
WHERE  tablename = 'customers'
  AND  attname   = 'id';

If n_distinct is wrong or the histogram is absent, statistics need refreshing.

Step 4 — Refresh statistics on both input tables

ANALYZE customers;
ANALYZE orders;
-- Then re-run Step 1 and compare the new Hash node estimate to actual rows.

Step 5 — If the estimate is now accurate but spill persists, raise work_mem

-- Scope the change to this session only — do not set globally without testing
SET LOCAL work_mem = '256MB';

EXPLAIN (ANALYZE, BUFFERS)
SELECT o.order_id, c.customer_name
FROM   orders o
JOIN   customers c ON o.customer_id = c.id
WHERE  o.created_at > '2024-01-01';
-- Confirm Batches: 1 and reduced actual time.

Step 6 — If the build input is still too large, reduce it with a tighter predicate

Push a filter onto the build side before the join to shrink the hash table footprint:

-- Wrap the build side in a subquery or CTE with a selective WHERE clause
SELECT o.order_id, c.customer_name
FROM   orders o
JOIN   (
  SELECT id, customer_name
  FROM   customers
  WHERE  region = 'EU'        -- reduces build input from 40 000 to ~8 000 rows
) c ON o.customer_id = c.id
WHERE  o.created_at > '2024-01-01';

Step 7 — Verify improved buffer hit ratios

After tuning, compare Buffers: shared hit and Buffers: read before and after. A successful in-memory hash join will show read=0 on the Hash node in re-runs because the OS page cache has warmed — or, more precisely, because no temporary file I/O is required.

Common Pitfalls #

Hash table spills to disk with no statistics change

The work_mem setting may have been reduced globally since the query last ran well. Check SHOW work_mem; inside the session. Also check for concurrent queries on the same connection pool slot that may have exhausted shared memory.

Severe data skew in the join key

One batch contains disproportionately many rows, so Batches stays high even after increasing work_mem. Inspect most_common_vals and most_common_freqs in pg_stats for the join column. Pre-aggregate or filter skewed keys before the join, or restructure the query to avoid joining on a low-cardinality column against a high-cardinality table.

Stale cardinality estimates after bulk loads

After large INSERT, COPY, or DELETE operations, autovacuum may not have run ANALYZE yet. The actual rows in the plan will diverge sharply from rows (estimated). Run ANALYZE manually after bulk data changes, or lower autovacuum_analyze_threshold for high-churn tables.

Implicit type cast on the join key

When both columns do not share the same data type, PostgreSQL inserts a runtime cast in the join condition. This is visible in EXPLAIN VERBOSE output as (o.customer_id)::bigint = c.id. Casts can prevent index use on the probe side and may invalidate bucket lookups if the cast is not injective. Fix the schema to align types rather than relying on runtime coercion.

Wrong join side chosen as build input

The optimizer should pick the smaller relation as the build side, but stale statistics can invert this: a large table is materialized into the hash table while a small table is used as probe input, wasting memory and triggering an unnecessary spill. After running ANALYZE, confirm the Hash node is on the smaller input.

work_mem raised globally and swamps shared memory

Setting work_mem = '1GB' globally on a server with 100 concurrent connections can exhaust OS RAM if several connections each run multi-hash-join queries. Prefer SET LOCAL in application code or connection pooler configuration for individual heavy queries rather than a blanket global change.

Frequently Asked Questions #

Why does the optimizer prefer a hash join over a nested loop? #

The optimizer selects a hash join when joining large, unsorted datasets where building an in-memory hash table is cheaper than repeated index lookups. It typically triggers when estimated row counts on both sides are large, an equi-join condition is present, and no highly selective index is available on the inner relation’s join key. Reviewing the cost estimation model that drives this decision helps you predict when the switch will occur and how to steer it.

How do I prevent a hash join from spilling to disk? #

Keep table statistics current with ANALYZE so the optimizer allocates the right amount of memory from the start. Increase session-level work_mem for the affected query using SET LOCAL, filter rows aggressively before the join to shrink the build input, or partition large tables to reduce the hash table footprint per partition.

Can I force a hash join in production queries? #

Planner toggles (SET LOCAL enable_nestloop = off) should only be used for diagnostic testing — for example, to compare actual runtimes when you suspect the optimizer is choosing suboptimally. Hardcoding join methods bypasses the optimizer’s ability to adapt to changing data distributions. Reserve this approach for confirmed optimizer bugs or very stable, well-understood query patterns where the data volume and index structure are frozen.

What does Batches > 1 mean in an EXPLAIN output? #

Batches > 1 means the hash table exceeded work_mem and PostgreSQL partitioned both inputs into temporary files on disk. Each additional batch pair requires the corresponding rows to be written and re-read from disk, adding significant I/O overhead — a 5–10× slowdown compared to a fully in-memory hash join is common. Eliminating batches is one of the highest-leverage tuning moves available for join-heavy queries. See the when does the optimizer choose a hash join page for more on the conditions that lead to this outcome.


Up: Execution Plan Fundamentals