Force YugabyteDB to Honor Your SQL Join Order

Modern SQL optimizers are incredibly smart. Most of the time, they’re far better at choosing an efficient execution plan than any human.

Most of the time.

But if you’ve been around long enough, especially if you’ve spent time as an Oracle developer or DBA, you know there are moments when you want the optimizer to stand down and simply do exactly what you wrote.

Back in my Oracle days, I leaned heavily on the optimizer hint:

				
					/*+ ORDERED */
				
			

When I used it, I wasn’t guessing… I was taking control. I knew the data, the skew, the cardinality quirks, the edge cases the optimizer couldn’t always see. And sometimes, forcing join order was the difference between a query that finished in milliseconds and one that went completely off the rails.

🧠 The same problem still exists… even in distributed SQL

YugabyteDB’s YSQL layer inherits the PostgreSQL query planner, which aggressively reorders joins to find what it believes is the cheapest plan.

That’s usually a good thing.

But in a distributed SQL database, join order can have much bigger consequences than it ever did on a single-node system:

  • ● Remote vs local reads

  • ● Network round trips

  • ● Data fan-out across tablets

  • ● Early vs late filtering of large fact tables

In other words:

  • Join order isn’t just a planning detail anymore. It can materially affect latency and cost.

And sometimes… you already know the right order.

🛠️ How YugabyteDB lets you take control (without hints)

YugabyteDB doesn’t support SQL join hints like Oracle’s /*+ ORDERED */. But it does expose a PostgreSQL mechanism that gives you essentially the same power, safely and transaction-scoped.

By temporarily disabling join reordering, you can force YugabyteDB to honor the join order exactly as written in your SQL.

Conceptually:

  • “Optimizer – don’t reorder anything. Execute these joins in the order I specified.”

And critically:

  • ● This applies only to the current transaction

  • ● It does not permanently affect the session

  • ● It’s ideal for debugging, benchmarking, and controlled experiments

🧪 When would you actually do this?

You might reach for this technique when:

  • ● You’re reproducing a problematic plan

  • ● You’re benchmarking two join strategies

  • ● You know the data distribution better than the stats

  • ● Cardinality estimates are off due to skew

  • ● You want deterministic behavior while tuning

Or, to put it the old Oracle way:

  • Sometimes you know the data better than the optimizer.
🧭 What this tip will show

In the sections that follow, we’ll:

  • 1. Spin up a 3-node YugabyteDB cluster using yugabyted

  • 2. Create a schema where join order really matters

  • 3. Run a query where the optimizer reorders joins

  • 4. Force YugabyteDB to honor the SQL join order

  • 5. Compare execution plans side-by-side

You’ll see exactly how this behaves, and why it’s a powerful tool when used deliberately.

⚠️ A word of caution

Just like /*+ ORDERED */ in Oracle, this is a sharp tool.

It’s not something you sprinkle everywhere… but when you need it, it’s invaluable.

Use it:

  • ● locally

  • ● intentionally

  • ● and with a clear understanding of why you’re doing it

🚀 Demo: Forcing join order on a 3-node YugabyteDB cluster
Prerequisites
  • ● YugabyteDB installed locally

  • yugabyted and ysqlsh in your PATH

  • ● Enough resources to run three local nodes

Step 1️⃣: Start a 3-node YugabyteDB cluster

Node 1

				
					yugabyted start \
  --base_dir=/tmp/yb-node1 \
  --listen=127.0.0.1 \
				
			

Node 2

				
					yugabyted start \
  --base_dir=/tmp/yb-node2 \
  --listen=127.0.0.2 \
  --join=127.0.0.1
				
			

Node 3

				
					yugabyted start \
  --base_dir=/tmp/yb-node3 \
  --listen=127.0.0.3 \
  --join=127.0.0.1
				
			
Step 2️⃣: Create a schema where join order matters

We’ll use:

  • ● a large fact table (events)

  • ● a medium dimension (users)

  • ● a tiny lookup (countries)

  • ● selective predicates that encourage the optimizer to reorder joins

				
					DROP DATABASE IF EXISTS join_order_demo;
CREATE DATABASE join_order_demo;
\c join_order_demo

CREATE TABLE events (
  event_id     BIGSERIAL PRIMARY KEY,
  user_id      BIGINT NOT NULL,
  event_type   TEXT NOT NULL,
  created_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE users (
  user_id     BIGINT PRIMARY KEY,
  country_id  INT NOT NULL,
  is_active   BOOLEAN NOT NULL
);

CREATE TABLE countries (
  country_id  INT PRIMARY KEY,
  code        TEXT NOT NULL UNIQUE
);

CREATE INDEX events_user_id_idx ON events(user_id);
CREATE INDEX events_type_idx    ON events(event_type);
CREATE INDEX users_active_idx   ON users(is_active);
CREATE INDEX users_country_idx  ON users(country_id);
CREATE INDEX countries_code_idx ON countries(code);
				
			
Step 3️⃣: Load skewed data
				
					INSERT INTO countries(country_id, code)
VALUES (1,'US'), (2,'CA'), (3,'GB'), (4,'OTHER');

INSERT INTO users(user_id, country_id, is_active)
SELECT i,
       CASE WHEN i % 100 = 0 THEN 3 ELSE 1 END,  -- ~1% GB
       (i % 10 = 0)                              -- ~10% active
FROM generate_series(1,200000) i;

INSERT INTO events(user_id, event_type, created_at)
SELECT (random()*199999)::int + 1,
       CASE WHEN random() < 0.02 THEN 'purchase' ELSE 'view' END,
       now() - (random() * interval '30 days')
FROM generate_series(1,2000000);
				
			

Analyze:

				
					ANALYZE;
				
			
Step 4️⃣: Run the query (optimizer free to reorder joins)
				
					EXPLAIN (ANALYZE, DIST)
SELECT count(*)
FROM events e
JOIN users u     ON u.user_id = e.user_id
JOIN countries c ON c.country_id = u.country_id
WHERE c.code = 'GB'
  AND u.is_active = true
  AND e.event_type = 'purchase';

				
			

Example:

				
					join_order_demo=# EXPLAIN (ANALYZE, DIST)
SELECT count(*)
FROM events e
JOIN users u     ON u.user_id = e.user_id
JOIN countries c ON c.country_id = u.country_id
WHERE c.code = 'GB'
  AND u.is_active = true
  AND e.event_type = 'purchase';
                                                                           QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=15364.12..15364.13 rows=1 width=8) (actual time=327.204..327.208 rows=1 loops=1)
   ->  Hash Join  (cost=5214.21..15361.80 rows=927 width=0) (actual time=102.817..327.090 rows=407 loops=1)
         Hash Cond: (e.user_id = u.user_id)
         ->  Index Scan using events_type_idx on events e  (cost=72.81..10067.39 rows=38333 width=8) (actual time=7.096..221.129 rows=40143 loops=1)
               Index Cond: (event_type = 'purchase'::text)
               Storage Table Read Requests: 40
               Storage Table Read Execution Time: 185.631 ms
               Storage Table Read Ops: 40
               Storage Table Rows Scanned: 40143
               Storage Index Read Requests: 40
               Storage Index Read Execution Time: 4.315 ms
               Storage Index Read Ops: 40
               Storage Index Rows Scanned: 40143
         ->  Hash  (cost=5080.93..5080.93 rows=4837 width=8) (actual time=95.654..95.657 rows=2000 loops=1)
               Buckets: 8192  Batches: 1  Memory Usage: 143kB
               ->  Hash Join  (cost=110.16..5080.93 rows=4837 width=8) (actual time=6.813..94.909 rows=2000 loops=1)
                     Hash Cond: (u.country_id = c.country_id)
                     ->  Index Scan using users_active_idx on users u  (cost=68.91..4935.08 rows=19347 width=12) (actual time=5.889..90.038 rows=20000 loops=1)
                           Index Cond: (is_active = true)
                           Storage Table Read Requests: 20
                           Storage Table Read Execution Time: 72.024 ms
                           Storage Table Read Ops: 20
                           Storage Table Rows Scanned: 20000
                           Storage Index Read Requests: 20
                           Storage Index Read Execution Time: 1.622 ms
                           Storage Index Read Ops: 20
                           Storage Index Rows Scanned: 20000
                     ->  Hash  (cost=41.24..41.24 rows=1 width=4) (actual time=0.895..0.896 rows=1 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 9kB
                           ->  Seq Scan on countries c  (cost=20.00..41.24 rows=1 width=4) (actual time=0.877..0.880 rows=1 loops=1)
                                 Storage Filter: (code = 'GB'::text)
                                 Storage Table Read Requests: 1
                                 Storage Table Read Execution Time: 0.762 ms
                                 Storage Table Read Ops: 1
                                 Storage Table Rows Scanned: 4
 Planning Time: 0.397 ms
 Execution Time: 327.298 ms
 Storage Read Requests: 121
 Storage Read Execution Time: 264.354 ms
 Storage Read Ops: 121
 Storage Rows Scanned: 120290
 Catalog Read Requests: 0
 Catalog Read Ops: 0
 Catalog Write Requests: 0
 Storage Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 264.354 ms
 Peak Memory Usage: 1476 kB
(48 rows)
				
			

What the planner does

From the plan:

  • ● Builds the small side first:

    • countries filtered to GB1 row

    • users_active_idx20,000 rows

    • ○ join → 2,000 active GB users

  • ● Then joins those users to events:

    • events_type_idx40,143 purchase events

    • ○ final result → 407 rows

Join shape (logically):

				
					(users ⨝ countries) ⨝ events
				
			

Key metrics from your run:

  • ● Final count: 407

  • ● Execution Time: 327.298 ms

  • ● Peak Memory: ~1.4 MB

Step 5️⃣: Force YugabyteDB to honor your SQL join order

Now we take control, just like /*+ ORDERED */:

				
					BEGIN;
SET LOCAL join_collapse_limit = 1;

EXPLAIN (ANALYZE, DIST)
SELECT count(*)
FROM events e
JOIN users u     ON u.user_id = e.user_id
JOIN countries c ON c.country_id = u.country_id
WHERE c.code = 'GB'
  AND u.is_active = true
  AND e.event_type = 'purchase';

COMMIT;
				
			

Example:

				
					join_order_demo=# BEGIN;
BEGIN

join_order_demo=*# SET LOCAL join_collapse_limit = 1;
SET

join_order_demo=*# EXPLAIN (ANALYZE, DIST)
join_order_demo-*# SELECT count(*)
join_order_demo-*# FROM events e
join_order_demo-*# JOIN users u     ON u.user_id = e.user_id
join_order_demo-*# JOIN countries c ON c.country_id = u.country_id
join_order_demo-*# WHERE c.code = 'GB'
join_order_demo-*#   AND u.is_active = true
join_order_demo-*#   AND e.event_type = 'purchase';
                                                                           QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=15449.80..15449.81 rows=1 width=8) (actual time=326.649..326.655 rows=1 loops=1)
   ->  Hash Join  (cost=5290.98..15447.48 rows=927 width=0) (actual time=101.078..326.504 rows=407 loops=1)
         Hash Cond: (u.country_id = c.country_id)
         ->  Hash Join  (cost=5249.73..15386.18 rows=3708 width=4) (actual time=100.248..324.861 rows=3859 loops=1)
               Hash Cond: (e.user_id = u.user_id)
               ->  Index Scan using events_type_idx on events e  (cost=72.81..10067.39 rows=38333 width=8) (actual time=7.294..214.608 rows=40143 loops=1)
                     Index Cond: (event_type = 'purchase'::text)
                     Storage Table Read Requests: 40
                     Storage Table Read Execution Time: 174.615 ms
                     Storage Table Read Ops: 40
                     Storage Table Rows Scanned: 40143
                     Storage Index Read Requests: 40
                     Storage Index Read Execution Time: 4.525 ms
                     Storage Index Read Ops: 40
                     Storage Index Rows Scanned: 40143
               ->  Hash  (cost=4935.08..4935.08 rows=19347 width=12) (actual time=92.887..92.888 rows=20000 loops=1)
                     Buckets: 32768  Batches: 1  Memory Usage: 1116kB
                     ->  Index Scan using users_active_idx on users u  (cost=68.91..4935.08 rows=19347 width=12) (actual time=6.810..86.139 rows=20000 loops=1)
                           Index Cond: (is_active = true)
                           Storage Table Read Requests: 20
                           Storage Table Read Execution Time: 69.047 ms
                           Storage Table Read Ops: 20
                           Storage Table Rows Scanned: 20000
                           Storage Index Read Requests: 20
                           Storage Index Read Execution Time: 2.015 ms
                           Storage Index Read Ops: 20
                           Storage Index Rows Scanned: 20000
         ->  Hash  (cost=41.24..41.24 rows=1 width=4) (actual time=0.801..0.803 rows=1 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 9kB
               ->  Seq Scan on countries c  (cost=20.00..41.24 rows=1 width=4) (actual time=0.788..0.792 rows=1 loops=1)
                     Storage Filter: (code = 'GB'::text)
                     Storage Table Read Requests: 1
                     Storage Table Read Execution Time: 0.657 ms
                     Storage Table Read Ops: 1
                     Storage Table Rows Scanned: 4
 Planning Time: 0.485 ms
 Execution Time: 326.774 ms
 Storage Read Requests: 121
 Storage Read Execution Time: 250.858 ms
 Storage Read Ops: 121
 Storage Rows Scanned: 120290
 Catalog Read Requests: 0
 Catalog Read Ops: 0
 Catalog Write Requests: 0
 Storage Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 250.858 ms
 Peak Memory Usage: 2547 kB
(48 rows)

join_order_demo=*# COMMIT;
COMMIT
				
			

What changed

With join reordering disabled:

  • ● YugabyteDB must follow the SQL order exactly

  • ● First join: events ⨝ users

    • ○ produces 3,859 rows

  • ● Second join: result ⨝ countries (GB)

    • ○ produces the same 407 rows

Join shape now is:

				
					(events ⨝ users) ⨝ countries
				
			

Key metrics from your run:

  • ● Final count: 407 (same result)

  • ● Execution Time: 326.774 ms

  • ● Peak Memory: ~2.5 MB

🧠 Key takeaways
  • ● YugabyteDB normally reorders joins… and that’s usually the right choice

  • ● Sometimes you know the data better than the optimizer

  • SET LOCAL join_collapse_limit = 1:

    • ○ prevents join reordering

    • ○ forces YugabyteDB to honor your SQL join order

    • ○ applies only to the current transaction

  • ● This is the modern equivalent of Oracle’s /*+ ORDERED */

The goal isn’t guaranteed speedups… it’s control, determinism, and predictability.

Used carefully, this is a powerful tool for debugging, benchmarking, and understanding query behavior in distributed SQL.

Have Fun!

A wholesome Charlie Brown Christmas puzzle that required 4 family members, 45 minutes, and way more brainpower than expected. 🎄🧩