Sort and Hash Node Analysis: Diagnosing Memory, Spills, and Execution Cost #
Sort and hash nodes are among the most reliable early-warning signals in a PostgreSQL execution plan — when they spill to disk, query latency can jump by an order of magnitude with no schema change required to trigger the regression.
When the Optimizer Chooses Sort and Hash Operations #
The planner introduces sort and hash operations based on the query’s semantic requirements and cost-model estimates. Understanding which conditions trigger each operator lets you intercept problems before they reach production.
For sort operations, the planner adds a Sort node whenever a query requires ordered output and no index already provides that ordering. Common triggers include explicit ORDER BY, GROUP BY without a supporting index, window function PARTITION BY / ORDER BY clauses, and merge join inputs when neither side arrives pre-sorted. The merge join vs nested loop article covers how pre-sorted inputs affect join algorithm selection.
For hash operations, the planner builds in-memory hash tables for hash join mechanics, HashAggregate, and hash-based DISTINCT. The cost model estimates whether the hash table fits in work_mem using pg_statistic row count estimates. When those estimates are accurate, the plan allocates sufficient memory. When they diverge from actuals — a condition you can detect during identifying plan bottlenecks — the operator spills to disk at runtime.
Both operators are eligible for parallelisation. In a parallel query execution plan, each worker runs its own sort or hash operation over its partition of rows, then a Gather Merge or Gather node at the top consolidates results.
Annotated EXPLAIN Node Breakdown #
Run EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) to expose the full set of runtime fields. The example below shows both a sort spill and a hash spill in the same plan:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT customer_id, SUM(order_total)
FROM orders
WHERE created_at > '2023-01-01'
GROUP BY customer_id
ORDER BY SUM(order_total) DESC;
Plan output (annotated):
Sort (cost=18420.35..18520.35 rows=40000 width=40)
(actual time=1823.4..2011.2 rows=40000 loops=1)
Sort Key: (sum(order_total)) DESC
Sort Method: external merge Disk: 42MB -- ← spill confirmed; need ≥42 MB more work_mem
Buffers: temp read=5376 written=5376 -- ← temp file I/O blocks
-> HashAggregate (cost=12000.00..14000.00 rows=40000 width=40)
(actual time=980.1..1240.5 rows=40000 loops=1)
Group Key: customer_id
Batches: 4 Memory Usage: 8192kB -- ← Batches>1 means hash agg also spilled
Buffers: shared hit=3200 read=800
-> Seq Scan on orders (cost=0.00..8000.00 rows=320000 width=16)
(actual time=0.1..310.2 rows=320000 loops=1)
Filter: (created_at > '2023-01-01'::date)
Rows Removed by Filter: 80000
Buffers: shared hit=3200 read=800
Fields to read in order:
| Field | Location | What it tells you |
|---|---|---|
Sort Method |
Sort node | quicksort = in-memory; top-N heapsort = LIMIT path; external merge = disk spill |
Disk: XkB |
Sort node | Bytes written to temp storage — minimum additional work_mem needed |
Batches: N |
Hash / HashAggregate node | 1 = single in-memory pass; >1 = partitioned spill, N×2 I/O cost |
Memory Usage: XkB |
Hash node | Actual hash table size; if near work_mem, next size increase triggers batching |
actual rows vs rows |
Any node | Large divergence explains why memory grant was under-sized |
temp read / written |
Buffers line | Confirms physical temp file activity |
Algorithm Internals #
How Sort Nodes Execute #
PostgreSQL’s sort executor selects among three algorithms at runtime:
- quicksort: The full input fits in
work_mem. All rows are loaded into memory and sorted with an in-place algorithm. No disk I/O; O(N log N) CPU cost. - top-N heapsort: Triggered when the query has a
LIMITclause. A fixed-size min-heap tracks only the top N rows, so memory use is proportional to the limit size, not the total row count. - external merge: The input exceeds
work_mem. Rows are sorted in memory in chunks (sorted runs), each run is written to a temporary file, then all runs are merged in a final pass. This is theDisk: XkBindicator.
The transition from quicksort to external merge is not gradual — it is triggered the moment a run would overflow work_mem. A single large row batch can push an otherwise in-memory sort into spill territory.
How Hash Nodes Execute #
Hash aggregation and hash joins share the same build/probe pattern:
- Build phase: The smaller input is scanned and each row is inserted into an in-memory hash table keyed on the join or group column. Memory consumption is tracked per
work_mem. - Probe phase: The larger input is scanned row by row. Each probe key is hashed and looked up in the table. Matching rows are emitted immediately, keeping streaming output latency low.
- Batching (spill): When the hash table grows beyond
work_mem, PostgreSQL repartitions the build input into N files on disk. The probe input is also partitioned to match. Each partition pair is then processed as a separate join or aggregation.Batches: 4means four complete passes over temporary files, quadrupling I/O compared toBatches: 1.
Before/after: adding work_mem resolves the spill
-- Before (work_mem = 4MB, default)
HashAggregate Batches: 4 Memory Usage: 4096kB
-- After (SET LOCAL work_mem = '64MB')
HashAggregate Batches: 1 Memory Usage: 38912kB
The batch count drops to 1 and memory usage reflects the true hash table size, eliminating all temporary file I/O.
Type Mismatches and Hash Distribution #
When join columns carry different data types — for example bigint on one side and integer on the other — the planner inserts an implicit cast into the hash condition. This forces per-row type conversion, increases CPU cost, and can reduce hash distribution quality. The cast appears in EXPLAIN VERBOSE output:
-- Problem: u.id is bigint, o.user_id is integer
SELECT * FROM users u
JOIN orders o ON u.id = o.user_id;
-- EXPLAIN VERBOSE shows: Hash Cond: (o.user_id::bigint = u.id)
The schema fix eliminates the cast entirely:
ALTER TABLE orders ALTER COLUMN user_id TYPE bigint;
Casting in the query (o.user_id::bigint) is a workaround, not a fix. It preserves the per-row cost and keeps the implicit conversion visible in the plan.
Memory, I/O, and Resource Behaviour #
work_mem is the single most impactful parameter for sort and hash performance. Crucially, work_mem is allocated per sort or hash operation per connection, not globally. A query with two sort nodes and one hash join can hold up to 3 × work_mem simultaneously. In a high-concurrency environment, global increases to work_mem can exhaust available RAM when many sessions execute complex queries concurrently.
Sizing strategy:
- Read the
Disk: XkBvalue from the Sort node or calculate required memory fromBatchesandMemory Usagein the Hash node. - The target
work_memfor a sort spill is approximately the spill disk size rounded up to the next power of two. - For hash spills with
Batches: N, the in-memory hash table isMemory Usage × Ntotal data — setwork_memto that value plus a 20% buffer. - Apply the increase with
SET LOCAL work_mem = 'XMB'at the session or transaction level first. Only promote toALTER ROLEorpostgresql.confafter validating the impact.
Temporary file activity appears in pg_stat_database under temp_files and temp_bytes. A non-zero and growing temp_bytes value in that view is a reliable signal that sort or hash spills are occurring in production workloads.
-- Check temp file accumulation since last stats reset
SELECT datname, temp_files, temp_bytes,
pg_size_pretty(temp_bytes) AS temp_size
FROM pg_stat_database
WHERE datname = current_database();
Step-by-Step Tuning Workflow #
-
Capture the baseline plan with full runtime detail:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) <your query here>; -
Locate Sort and Hash nodes in the output. Scan for
Sort Method:,Batches:, andDisk:fields. -
Confirm cardinality accuracy — compare
rows(estimate) toactual rows. A 5× or greater divergence on a Hash build input explains most spills caused by under-allocation:-- Check statistics currency for a table SELECT attname, n_distinct, correlation, most_common_vals, most_common_freqs FROM pg_stats WHERE tablename = 'orders' AND attname IN ('customer_id', 'created_at'); -
Update statistics if stale, then re-check the plan:
ANALYZE orders; -
Test a higher
work_memlocally to measure impact before changing configuration:BEGIN; SET LOCAL work_mem = '64MB'; EXPLAIN (ANALYZE, BUFFERS) <your query>; ROLLBACK;Confirm
Sort Methodchanges toquicksortorBatchesdrops to1. -
Evaluate index elimination — if the sort node is on a column sequence that could be indexed, create a covering index and re-run:
CREATE INDEX idx_orders_sort_cover ON orders (created_at, customer_id) INCLUDE (order_total);With this index, many query shapes read rows in
created_atorder directly, replacingSeq Scan → SortwithIndex Scan Backward, eliminating the sort node entirely. -
Apply the fix at the appropriate scope:
SET LOCALfor a single transaction,SETfor the session,ALTER ROLE name SET work_mem = 'XMB'for a specific application user, orpostgresql.conffor a server-wide default. For filter pushdown mechanics and predicate placement that can reduce rows before the sort/hash phase, verify thatWHEREpredicates apply as early in the plan as possible. -
Monitor post-deployment by tracking
temp_bytesinpg_stat_databaseand queryingpg_stat_statementsto confirm the target query’s total execution time has decreased:SELECT query, calls, total_exec_time, mean_exec_time, temp_blks_written FROM pg_stat_statements WHERE query ILIKE '%orders%' ORDER BY temp_blks_written DESC LIMIT 10;
Common Pitfalls #
Misreading “external merge” as a minor detail. The phrase appears inline in the plan text but represents a 10–100× latency increase compared to an in-memory sort. Always act on it.
Ignoring Batches on HashAggregate nodes. Batches appears on hash aggregate nodes in addition to hash join nodes. A plan can show Batches: 1 on the Hash join child while Batches: 4 on the HashAggregate above — both represent independent spills, each requiring its own work_mem share.
Raising work_mem globally without load testing. A server with 200 concurrent connections and work_mem = 64MB can allocate up to 12.8 GB on sort-heavy workloads before hitting OOM. Test with realistic concurrency before changing postgresql.conf.
Eliminating indexes based on sort column alone. An index on (created_at) eliminates the sort on ORDER BY created_at but does not cover a query that selects order_total. The scan must still fetch the heap unless INCLUDE (order_total) is added. Verify with EXPLAIN ANALYZE that Heap Fetches drops after index creation — see covering index design for the full pattern.
Using query-level casts as a substitute for schema fixes. Adding ::bigint to the join predicate in a query suppresses the visible cast in Hash Cond but still executes the conversion at runtime. Fix mismatched column types in the schema with ALTER TABLE.
Overlooking the interaction with parallel workers. In a parallel plan, each worker allocates its own work_mem for sort and hash operations. A Parallel Hash node with four workers can use 4 × work_mem simultaneously. Confirm with EXPLAIN ANALYZE that Workers Launched matches Workers Planned and that actual time across workers is balanced — large variance signals data skew rather than a memory problem.
Frequently Asked Questions #
How do I confirm a sort node is spilling to disk?
Look for Sort Method: external merge Disk: XkB in EXPLAIN ANALYZE output. Any non-zero disk value confirms a spill has occurred. The reported disk size is the amount of data written — use it as a lower bound for the additional work_mem needed to keep the sort in memory.
How do I confirm a hash node is spilling to disk?
Check the Hash or HashAggregate child node for Batches: N where N is greater than 1. Each doubling of batch count roughly doubles the temporary file I/O, because each partition must be written to disk and re-read during the probe pass. Memory Usage shows the in-memory fraction; multiply by Batches to estimate total data volume.
When should I prefer a merge join over a hash join?
Merge joins are preferable when both inputs arrive pre-sorted from index scans, because no in-memory hash table is required and sort cost is eliminated. Hash joins excel when joining large, unsorted datasets and memory is sufficient for the build side. Validate with EXPLAIN (ANALYZE, BUFFERS) comparing actual I/O block counts and execution time across both plan shapes — use SET LOCAL enable_hashjoin = off to force the merge join path for comparison.
Can I force the optimizer away from a hash or sort algorithm for diagnostic purposes?
Yes: SET LOCAL enable_hashjoin = off, SET LOCAL enable_mergejoin = off, or SET LOCAL enable_sort = off redirect the planner to alternative paths within that transaction. Use these only to compare plan shapes — overriding the planner bypasses cost-based decisions and must never be left in place in production configuration.
Related #
- Debugging Unexpected Sort Operations — isolate implicit sorts introduced by
DISTINCT, window functions, and merge join prerequisites - Identifying Plan Bottlenecks — systematic workflow for locating the highest-cost node in any plan
- Parallel Query Execution — how parallel workers distribute hash and sort workloads and how to diagnose worker-level skew
- Filter Pushdown Mechanics — reduce row counts before sort and hash phases by controlling predicate placement
- Covering Index Design — eliminate sort nodes entirely by aligning index column order with ORDER BY sequences