When work_mem Affects Batched Nested Loop Performance

Sometimes a query plan can look right at a high level, but the distributed execution details tell the real story.

In YugabyteDB, a query may use a YB Batched Nested Loop Join, which is often exactly what you want for repeated key lookups. But if the batches are not staying together efficiently, the query may issue more storage read requests than expected.

That can increase latency, raise CPU usage, and reduce throughput.

This pattern can show up even in workloads that were already running successfully. As data grows, tablet counts change, concurrency increases, or query shapes become more sophisticated, a previously healthy workload can become more sensitive to executor memory settings.

This tip shows how to recognize that pattern, how work_mem can influence batching behavior, and how to test a tuning change safely before applying it more broadly.

The Pattern

A query may use a plan shape like

				
					YB Batched Nested Loop Join
  -> Function Scan
  -> Index Scan using some_primary_key_or_index
       Index Cond: ROW(...) = ANY (...)
				
			

At first glance, that looks promising.

The plan is using a batched nested loop join, and the inner index lookup is using an ANY array of composite keys. That usually means YugabyteDB is trying to group multiple lookups together instead of issuing one lookup at a time.

But the most important clue may be lower in the EXPLAIN output.

For example, an efficient batched execution may show:

				
					Storage Table Read Requests: 1
				
			

A less efficient execution may show something like:

				
					Storage Table Read Requests: 99
				
			

That difference matters.

The query may still be using a batched nested loop join, but the batch may be getting flushed in smaller chunks. Instead of sending one efficient batched request to the storage layer, YugabyteDB may end up issuing many smaller reads.

That can increase latency, CPU usage, and overall system pressure.

Why Batching Efficiency Matters

Distributed databases care deeply about round trips.

A single batched storage request is usually much cheaper than dozens or hundreds of smaller requests.

Conceptually:

				
					Efficient batching
  ↓
Fewer storage requests
  ↓
Less RPC overhead
  ↓
Lower latency
  ↓
Lower CPU pressure
				
			

But when batching is less efficient:

				
					Smaller batches
  ↓
More storage requests
  ↓
More RPC overhead
  ↓
Higher latency
  ↓
Higher CPU pressure
				
			

This is why two plans that look similar at the top can behave very differently at runtime.

Where work_mem Comes In

work_mem is a YSQL setting inherited from PostgreSQL. It is a per-operation memory limit, not a fixed allocation per connection. A previous YugabyteDB Tip covered the broader sizing considerations in Tuning work_mem in YugabyteDB.

For this tip, the important point is how work_mem can affect batching behavior. For batched nested loop joins, available executor memory can influence how much work can be grouped together before batches are sent to the storage layer.

Note: Because work_mem is a per-operation limit, memory usage can add up quickly when many sessions are active or when a query has multiple memory-using operators.

A simplified way to think about it:

				
					More usable executor memory
  ↓
Larger lookup batches can stay together
  ↓
Fewer storage read requests
  ↓
Better performance
				
			

But:

				
					Too much work_mem
  ↓
Many concurrent sessions may consume more memory
  ↓
Possible memory pressure
  ↓
Potential OOM risk
				
			
So the right answer is not:
  • Set work_mem as high as possible.

The better approach is:

  • ● Test at the session level.
  • ● Compare the distributed plan details.
  • ● Then size the setting safely for concurrency.

What To Look For

Run the query with:

				
					EXPLAIN (ANALYZE, DIST)
<your_query>;
				
			

Then look for these signs:

Plan Detail What It Means
YB Batched Nested Loop Join YugabyteDB chose a batched lookup strategy.
ROW(...) = ANY (...) Multiple lookup keys are being grouped into a batched request.
Storage Table Read Requests: 1 The lookup batch was sent efficiently as a small number of storage requests.
Storage Table Read Requests: 99 The lookup batch may have been split into many smaller storage requests.
Similar plan shape, much slower runtime The issue may be execution-level batching behavior, not a completely different plan.

Example Symptom

Here is a simplified example of what you might see.

Less efficient execution:

				
					YB Batched Nested Loop Join
  Join Filter: (...)
  -> Function Scan on u
  -> Index Scan using source_attributes_store_pkey on source_attributes_store t
       Index Cond: ROW(...) = ANY (...)
       Storage Table Read Requests: 99
				
			

More efficient execution:

				
					YB Batched Nested Loop Join
  Join Filter: (...)
  -> Function Scan on u
  -> Index Scan using source_attributes_store_pkey on source_attributes_store t
       Index Cond: ROW(...) = ANY (...)
       Storage Table Read Requests: 1
				
			

The plan shape is similar. The difference is the number of storage table read requests.

That is the clue.

A Safe Session-Level Test

Before changing anything cluster-wide, test in a single session.

For example:

				
					SET work_mem = '32MB';

EXPLAIN (ANALYZE, DIST)
<your_query>;
				
			

Then compare the result to the original execution.

Look for:

  • ● Execution time
  • ● Storage Table Read Requests
  • ● Rows returned
  • ● Plan shape

You can test incrementally:

				
					SET work_mem = '16MB';
EXPLAIN (ANALYZE, DIST)
<your_query>;

SET work_mem = '32MB';
EXPLAIN (ANALYZE, DIST)
<your_query>;

SET work_mem = '64MB';
EXPLAIN (ANALYZE, DIST)
<your_query>;
				
			

For many workloads, moderate values such as 16MB, 32MB, or 64MB may be enough to restore efficient batching. For more advanced analytical or highly demanding workloads, higher values may be justified, but only after testing and memory sizing under realistic concurrency.

The goal is not to find the biggest value…

The goal is to find the smallest value that gives the workload the batching behavior and execution time it needs.

Why Not Just Set work_mem Very High?

Because work_mem can multiply quickly.

A rough safety model is:

				
					active physical connections
  × possible memory-using plan nodes per query
  × work_mem
				
			

For example:

				
					480 active physical connections × 32MB
				
			

is very different from:

				
					480 active physical connections × 256MB
				
			

And the real risk can be higher because a single query may use multiple memory-consuming operators over its lifetime. Sorts, hashes, aggregations, joins, and batched lookup operations can each contribute to memory usage.

This is why work_mem tuning is one of the more advanced areas of PostgreSQL and YSQL performance work. As query complexity and performance expectations increase, work_mem often becomes an important part of tuning.

Tip: Treat work_mem as a concurrency-sensitive setting. A value that is safe for one session may not be safe for hundreds of active physical connections. Larger values can be appropriate for some workloads, but they should be justified by testing and sized against available memory.

Applying the Change More Broadly

After session-level testing, you may decide to apply a new value more broadly.

In YugabyteDB, work_mem can be configured through YSQL PostgreSQL configuration settings.

For example, using ysql_pg_conf_csv:

				
					--ysql_pg_conf_csv=work_mem=32MB
				
			

In YugabyteDB Anywhere, apply the change through the appropriate universe configuration workflow.

Be sure to consider:

  • ● Node memory
  • ● TServer memory allocation
  • ● YSQL/Postgres memory allowance
  • ● Active physical connections
  • ● Connection Manager settings
  • ● Query complexity
  • ● Expected concurrency

If you are using Connection Manager, remember that logical application connections and physical PostgreSQL connections are not always the same thing. Size memory against the physical connections that can actively run work.

Helpful Follow-Up Diagnostics

If the issue is intermittent or workload-dependent, sampled plan logging can help.

One option is auto_explain.

Example:

				
					CREATE EXTENSION IF NOT EXISTS auto_explain;
				
			

Example settings:

				
					auto_explain.log_min_duration = '150ms'
auto_explain.sample_rate = 0.1
auto_explain.log_analyze = true
auto_explain.log_dist = true
				
			

This can help capture distributed execution details for a sample of slower queries.

You can also use duration-based sampling settings such as:

				
					log_min_duration_sample = '1000ms'
log_statement_sample_rate = 0.1
				
			

Use caution with logging volume. Start with sampling, especially on busy systems.

Troubleshooting Checklist

Step Action Why It Helps
1 Run EXPLAIN (ANALYZE, DIST). Shows distributed execution details.
2 Look for YB Batched Nested Loop Join. Confirms the query is using a batched lookup strategy.
3 Check Storage Table Read Requests. Reveals whether batching is efficient or fragmented.
4 Test a moderate session-level work_mem increase. Validates whether memory is limiting batching.
5 Compare execution time and storage reads. Confirms whether the change actually helped.
6 Calculate safe cluster-wide memory impact. Avoids creating memory pressure under concurrency.
7 Use sampled logging or auto_explain. Helps catch other slow query patterns safely.

What About Tablet Count?

Tablet count can also matter.

As tables grow, split, or become more distributed, a batched lookup may need to coordinate work across more tablets. That does not mean tablet splitting is bad. Tablet splitting is an important part of scaling distributed workloads.

But it does mean that queries with large batched lookups may become more sensitive to executor memory and request batching behavior as the physical distribution of data changes.

In other words, tablet growth, data growth, higher concurrency, or more complex query patterns may not be the root cause by themselves. They can be the trigger that exposes a workload’s sensitivity to work_mem.

That is another reason to focus on the distributed execution details instead of only the high-level plan shape.

Final Takeaway

A YB Batched Nested Loop Join is usually a good sign, but the join name alone does not tell the whole story.

Always check the distributed execution details.

If Storage Table Read Requests is much higher than expected, the query may be flushing batches too frequently. In that case, a carefully tested work_mem increase can help YugabyteDB keep lookup batches together, reduce storage round trips, and improve latency.

This can happen even with pre-existing workloads that were previously running well. As the workload evolves, the sensitive balance between batching, executor memory, tablet distribution, and concurrency can change.

Tune gradually, validate with EXPLAIN (ANALYZE, DIST), and size the setting for real concurrency.

Have Fun!

I love every spring when the baby robins return to our evergreen trees!