Modern SQL optimizers are incredibly smart. Most of the time, they’re far better at choosing an efficient execution plan than any human.
Most of the time.
But if you’ve been around long enough, especially if you’ve spent time as an Oracle developer or DBA, you know there are moments when you want the optimizer to stand down and simply do exactly what you wrote.
Back in my Oracle days, I leaned heavily on the optimizer hint:
/*+ ORDERED */
When I used it, I wasn’t guessing… I was taking control. I knew the data, the skew, the cardinality quirks, the edge cases the optimizer couldn’t always see. And sometimes, forcing join order was the difference between a query that finished in milliseconds and one that went completely off the rails.
🧠 The same problem still exists… even in distributed SQL
YugabyteDB’s YSQL layer inherits the PostgreSQL query planner, which aggressively reorders joins to find what it believes is the cheapest plan.
That’s usually a good thing.
But in a distributed SQL database, join order can have much bigger consequences than it ever did on a single-node system:
● Remote vs local reads
● Network round trips
● Data fan-out across tablets
● Early vs late filtering of large fact tables
In other words:
- Join order isn’t just a planning detail anymore. It can materially affect latency and cost.
And sometimes… you already know the right order.
🛠️ How YugabyteDB lets you take control (without hints)
YugabyteDB doesn’t support SQL join hints like Oracle’s /*+ ORDERED */. But it does expose a PostgreSQL mechanism that gives you essentially the same power, safely and transaction-scoped.
By temporarily disabling join reordering, you can force YugabyteDB to honor the join order exactly as written in your SQL.
Conceptually:
- “Optimizer – don’t reorder anything. Execute these joins in the order I specified.”
And critically:
● This applies only to the current transaction
● It does not permanently affect the session
● It’s ideal for debugging, benchmarking, and controlled experiments
🧪 When would you actually do this?
You might reach for this technique when:
● You’re reproducing a problematic plan
● You’re benchmarking two join strategies
● You know the data distribution better than the stats
● Cardinality estimates are off due to skew
● You want deterministic behavior while tuning
Or, to put it the old Oracle way:
- Sometimes you know the data better than the optimizer.
🧭 What this tip will show
In the sections that follow, we’ll:
1. Spin up a 3-node YugabyteDB cluster using
yugabyted2. Create a schema where join order really matters
3. Run a query where the optimizer reorders joins
4. Force YugabyteDB to honor the SQL join order
5. Compare execution plans side-by-side
You’ll see exactly how this behaves, and why it’s a powerful tool when used deliberately.
⚠️ A word of caution
Just like /*+ ORDERED */ in Oracle, this is a sharp tool.
It’s not something you sprinkle everywhere… but when you need it, it’s invaluable.
Use it:
● locally
● intentionally
● and with a clear understanding of why you’re doing it
🚀 Demo: Forcing join order on a 3-node YugabyteDB cluster
Prerequisites
● YugabyteDB installed locally
●
yugabytedandysqlshin yourPATH● Enough resources to run three local nodes
Step 1️⃣: Start a 3-node YugabyteDB cluster
Node 1
yugabyted start \
--base_dir=/tmp/yb-node1 \
--listen=127.0.0.1 \
Node 2
yugabyted start \
--base_dir=/tmp/yb-node2 \
--listen=127.0.0.2 \
--join=127.0.0.1
Node 3
yugabyted start \
--base_dir=/tmp/yb-node3 \
--listen=127.0.0.3 \
--join=127.0.0.1
Step 2️⃣: Create a schema where join order matters
We’ll use:
● a large fact table (
events)● a medium dimension (
users)● a tiny lookup (
countries)● selective predicates that encourage the optimizer to reorder joins
DROP DATABASE IF EXISTS join_order_demo;
CREATE DATABASE join_order_demo;
\c join_order_demo
CREATE TABLE events (
event_id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL,
event_type TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE users (
user_id BIGINT PRIMARY KEY,
country_id INT NOT NULL,
is_active BOOLEAN NOT NULL
);
CREATE TABLE countries (
country_id INT PRIMARY KEY,
code TEXT NOT NULL UNIQUE
);
CREATE INDEX events_user_id_idx ON events(user_id);
CREATE INDEX events_type_idx ON events(event_type);
CREATE INDEX users_active_idx ON users(is_active);
CREATE INDEX users_country_idx ON users(country_id);
CREATE INDEX countries_code_idx ON countries(code);
Step 3️⃣: Load skewed data
INSERT INTO countries(country_id, code)
VALUES (1,'US'), (2,'CA'), (3,'GB'), (4,'OTHER');
INSERT INTO users(user_id, country_id, is_active)
SELECT i,
CASE WHEN i % 100 = 0 THEN 3 ELSE 1 END, -- ~1% GB
(i % 10 = 0) -- ~10% active
FROM generate_series(1,200000) i;
INSERT INTO events(user_id, event_type, created_at)
SELECT (random()*199999)::int + 1,
CASE WHEN random() < 0.02 THEN 'purchase' ELSE 'view' END,
now() - (random() * interval '30 days')
FROM generate_series(1,2000000);
Analyze:
ANALYZE;
Step 4️⃣: Run the query (optimizer free to reorder joins)
EXPLAIN (ANALYZE, DIST)
SELECT count(*)
FROM events e
JOIN users u ON u.user_id = e.user_id
JOIN countries c ON c.country_id = u.country_id
WHERE c.code = 'GB'
AND u.is_active = true
AND e.event_type = 'purchase';
Example:
join_order_demo=# EXPLAIN (ANALYZE, DIST)
SELECT count(*)
FROM events e
JOIN users u ON u.user_id = e.user_id
JOIN countries c ON c.country_id = u.country_id
WHERE c.code = 'GB'
AND u.is_active = true
AND e.event_type = 'purchase';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=15364.12..15364.13 rows=1 width=8) (actual time=327.204..327.208 rows=1 loops=1)
-> Hash Join (cost=5214.21..15361.80 rows=927 width=0) (actual time=102.817..327.090 rows=407 loops=1)
Hash Cond: (e.user_id = u.user_id)
-> Index Scan using events_type_idx on events e (cost=72.81..10067.39 rows=38333 width=8) (actual time=7.096..221.129 rows=40143 loops=1)
Index Cond: (event_type = 'purchase'::text)
Storage Table Read Requests: 40
Storage Table Read Execution Time: 185.631 ms
Storage Table Read Ops: 40
Storage Table Rows Scanned: 40143
Storage Index Read Requests: 40
Storage Index Read Execution Time: 4.315 ms
Storage Index Read Ops: 40
Storage Index Rows Scanned: 40143
-> Hash (cost=5080.93..5080.93 rows=4837 width=8) (actual time=95.654..95.657 rows=2000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 143kB
-> Hash Join (cost=110.16..5080.93 rows=4837 width=8) (actual time=6.813..94.909 rows=2000 loops=1)
Hash Cond: (u.country_id = c.country_id)
-> Index Scan using users_active_idx on users u (cost=68.91..4935.08 rows=19347 width=12) (actual time=5.889..90.038 rows=20000 loops=1)
Index Cond: (is_active = true)
Storage Table Read Requests: 20
Storage Table Read Execution Time: 72.024 ms
Storage Table Read Ops: 20
Storage Table Rows Scanned: 20000
Storage Index Read Requests: 20
Storage Index Read Execution Time: 1.622 ms
Storage Index Read Ops: 20
Storage Index Rows Scanned: 20000
-> Hash (cost=41.24..41.24 rows=1 width=4) (actual time=0.895..0.896 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on countries c (cost=20.00..41.24 rows=1 width=4) (actual time=0.877..0.880 rows=1 loops=1)
Storage Filter: (code = 'GB'::text)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 0.762 ms
Storage Table Read Ops: 1
Storage Table Rows Scanned: 4
Planning Time: 0.397 ms
Execution Time: 327.298 ms
Storage Read Requests: 121
Storage Read Execution Time: 264.354 ms
Storage Read Ops: 121
Storage Rows Scanned: 120290
Catalog Read Requests: 0
Catalog Read Ops: 0
Catalog Write Requests: 0
Storage Write Requests: 0
Storage Flush Requests: 0
Storage Execution Time: 264.354 ms
Peak Memory Usage: 1476 kB
(48 rows)
What the planner does
From the plan:
● Builds the small side first:
○
countriesfiltered toGB→ 1 row○
users_active_idx→ 20,000 rows○ join → 2,000 active GB users
● Then joins those users to
events:○
events_type_idx→ 40,143 purchase events○ final result → 407 rows
Join shape (logically):
(users ⨝ countries) ⨝ events
Key metrics from your run:
● Final count: 407
● Execution Time: 327.298 ms
● Peak Memory: ~1.4 MB
Step 5️⃣: Force YugabyteDB to honor your SQL join order
Now we take control, just like /*+ ORDERED */:
BEGIN;
SET LOCAL join_collapse_limit = 1;
EXPLAIN (ANALYZE, DIST)
SELECT count(*)
FROM events e
JOIN users u ON u.user_id = e.user_id
JOIN countries c ON c.country_id = u.country_id
WHERE c.code = 'GB'
AND u.is_active = true
AND e.event_type = 'purchase';
COMMIT;
Example:
join_order_demo=# BEGIN;
BEGIN
join_order_demo=*# SET LOCAL join_collapse_limit = 1;
SET
join_order_demo=*# EXPLAIN (ANALYZE, DIST)
join_order_demo-*# SELECT count(*)
join_order_demo-*# FROM events e
join_order_demo-*# JOIN users u ON u.user_id = e.user_id
join_order_demo-*# JOIN countries c ON c.country_id = u.country_id
join_order_demo-*# WHERE c.code = 'GB'
join_order_demo-*# AND u.is_active = true
join_order_demo-*# AND e.event_type = 'purchase';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=15449.80..15449.81 rows=1 width=8) (actual time=326.649..326.655 rows=1 loops=1)
-> Hash Join (cost=5290.98..15447.48 rows=927 width=0) (actual time=101.078..326.504 rows=407 loops=1)
Hash Cond: (u.country_id = c.country_id)
-> Hash Join (cost=5249.73..15386.18 rows=3708 width=4) (actual time=100.248..324.861 rows=3859 loops=1)
Hash Cond: (e.user_id = u.user_id)
-> Index Scan using events_type_idx on events e (cost=72.81..10067.39 rows=38333 width=8) (actual time=7.294..214.608 rows=40143 loops=1)
Index Cond: (event_type = 'purchase'::text)
Storage Table Read Requests: 40
Storage Table Read Execution Time: 174.615 ms
Storage Table Read Ops: 40
Storage Table Rows Scanned: 40143
Storage Index Read Requests: 40
Storage Index Read Execution Time: 4.525 ms
Storage Index Read Ops: 40
Storage Index Rows Scanned: 40143
-> Hash (cost=4935.08..4935.08 rows=19347 width=12) (actual time=92.887..92.888 rows=20000 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 1116kB
-> Index Scan using users_active_idx on users u (cost=68.91..4935.08 rows=19347 width=12) (actual time=6.810..86.139 rows=20000 loops=1)
Index Cond: (is_active = true)
Storage Table Read Requests: 20
Storage Table Read Execution Time: 69.047 ms
Storage Table Read Ops: 20
Storage Table Rows Scanned: 20000
Storage Index Read Requests: 20
Storage Index Read Execution Time: 2.015 ms
Storage Index Read Ops: 20
Storage Index Rows Scanned: 20000
-> Hash (cost=41.24..41.24 rows=1 width=4) (actual time=0.801..0.803 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on countries c (cost=20.00..41.24 rows=1 width=4) (actual time=0.788..0.792 rows=1 loops=1)
Storage Filter: (code = 'GB'::text)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 0.657 ms
Storage Table Read Ops: 1
Storage Table Rows Scanned: 4
Planning Time: 0.485 ms
Execution Time: 326.774 ms
Storage Read Requests: 121
Storage Read Execution Time: 250.858 ms
Storage Read Ops: 121
Storage Rows Scanned: 120290
Catalog Read Requests: 0
Catalog Read Ops: 0
Catalog Write Requests: 0
Storage Write Requests: 0
Storage Flush Requests: 0
Storage Execution Time: 250.858 ms
Peak Memory Usage: 2547 kB
(48 rows)
join_order_demo=*# COMMIT;
COMMIT
What changed
With join reordering disabled:
● YugabyteDB must follow the SQL order exactly
● First join:
events ⨝ users○ produces 3,859 rows
● Second join: result ⨝
countries (GB)○ produces the same 407 rows
Join shape now is:
(events ⨝ users) ⨝ countries
Key metrics from your run:
● Final count: 407 (same result)
● Execution Time: 326.774 ms
● Peak Memory: ~2.5 MB
🧠 Key takeaways
● YugabyteDB normally reorders joins… and that’s usually the right choice
● Sometimes you know the data better than the optimizer
●
SET LOCAL join_collapse_limit = 1:○ prevents join reordering
○ forces YugabyteDB to honor your SQL join order
○ applies only to the current transaction
● This is the modern equivalent of Oracle’s
/*+ ORDERED */
The goal isn’t guaranteed speedups… it’s control, determinism, and predictability.
Used carefully, this is a powerful tool for debugging, benchmarking, and understanding query behavior in distributed SQL.
Have Fun!
