Recreating Production Optimizer Behavior in YugabyteDB with cbo_stat_dump

TL;DR
When a query plan changes in production, teams often try to copy production data into another environment to reproduce the issue. That’s slow, risky, and usually unnecessary. The cbo_stat_dump utility lets you capture the optimizer’s inputs… schema, planner settings, and statistics… so you can recreate production optimizer behavior without copying customer data.

What This Tip Covers

This is a practical, field-ready walkthrough of how to:

● Export schema + planner settings + statistics
● Recreate optimizer cost model inputs
● Reproduce join strategy decisions (Nested Loop vs Hash Join)
● Debug plan changes safely
● Share reproducible cases with engineering

We are not reproducing exact runtime metrics or timing; we are reproducing the optimizer’s mental model.

That distinction matters.

Why Optimizer Behavior Changes

In YugabyteDB (like PostgreSQL), join strategies are chosen based on:

● Table cardinality (pg_class.reltuples)
● Column selectivity (pg_statistic)
● Extended statistics
● Planner GUCs
● (In YB) relevant gFlags

If statistics change, the plan can flip… nothing about the SQL has to change.

That’s what makes debugging tricky.

The Demo: Join Strategy Flip

We’ll build a scenario where:

● With highly selective filtering → Nested Loop
● With non-selective filtering → Hash Join

Then we’ll:

1. Capture the optimizer context using cbo_stat_dump
2. Recreate it in a clean environment
3. Observe the same join strategy, even with empty tables

Step 0: Install `cbo_stat_dump`

Important
Even though this is for YugabyteDB, the Python driver (psycopg2-yugabytedb) compiles against PostgreSQL client libraries (libpq). You do not need a PostgreSQL server, but you do need the client development headers.

On a machine that can reach your DB

Install Required System Packages (RHEL / Alma / Rocky)

				
					sudo dnf groupinstall -y "Development Tools"

sudo dnf install -y \
  python3-devel \
  postgresql15-devel \
  gcc \
  make

Adjust postgresql15-devel if you are using a different libpq version.

Clone and Set Up the Utility

				
					git clone https://github.com/yugabyte/cbo_stat_dump.git
cd cbo_stat_dump
git checkout e8d78272d3d962f9ff4aeccdf29228da5e9d3139

Create a virtual environment:

				
					python3 -m venv venv
source venv/bin/activate

Install dependencies:

				
					pip install -r requirements.txt

Exit from virtual environment:

				
					deactivate

Why use a virtual environment?
The cbo_stat_dump utility depends on Python packages (including psycopg2-yugabytedb) that are installed locally inside the virtual environment. Activating the venv ensures the correct Python interpreter and dependencies are used when running the tool, avoids conflicts with system-wide packages, and keeps your OS clean.

Step 1: Create Schema

				
					CREATE SCHEMA IF NOT EXISTS demo;

DROP TABLE IF EXISTS demo.orders;
DROP TABLE IF EXISTS demo.customers;

CREATE TABLE demo.customers (
    customer_id INT PRIMARY KEY,
    region      TEXT NOT NULL,
    status      TEXT NOT NULL
);

CREATE TABLE demo.orders (
    order_id     BIGSERIAL PRIMARY KEY,
    customer_id  INT NOT NULL,
    order_total  NUMERIC,
    created_at   TIMESTAMPTZ NOT NULL,
    FOREIGN KEY (customer_id) REFERENCES demo.customers(customer_id)
);

CREATE INDEX orders_customer_id_idx ON demo.orders(customer_id);
CREATE INDEX customers_region_idx ON demo.customers(region);

Step 2: Insert Skewed Data

We create:

● 100,000 customers
● Only 100 VIP customers
● 5,000,000 orders

				
					INSERT INTO demo.customers
SELECT
    gs,
    CASE
        WHEN gs <= 100 THEN 'VIP'
        ELSE 'STANDARD'
    END,
    'ACTIVE'
FROM generate_series(1, 100000) gs;

INSERT INTO demo.orders (customer_id, order_total, created_at)
SELECT
    (random() * 99999 + 1)::int,
    random() * 1000,
    NOW() - ((gs % 86400) || ' seconds')::interval
FROM generate_series(1, 5000000) gs;

ANALYZE demo.customers;
ANALYZE demo.orders;

Step 3: The Query

				
					SELECT o.order_id, o.order_total
FROM demo.orders o
JOIN demo.customers c
  ON o.customer_id = c.customer_id
WHERE c.region = 'VIP';

With accurate stats, the planner sees:

● Only 100 VIP rows
● Very selective predicate

Typical plan shape:

				
					Nested Loop
  -> Index Scan on customers (region = 'VIP')
  -> Index Scan on orders using orders_customer_id_idx

Example:

				
					EXPLAIN
SELECT o.order_id, o.order_total
FROM demo.orders o
JOIN demo.customers c
  ON o.customer_id = c.customer_id
WHERE c.region = 'VIP';

				
					.                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------
 YB Batched Nested Loop Join  (cost=114.07..1464.64 rows=4500 width=19)
   Join Filter: (o.customer_id = c.customer_id)
   ->  Index Scan using customers_region_idx on customers c  (cost=42.82..86.88 rows=90 width=4)
         Index Cond: (region = 'VIP'::text)
   ->  Index Scan using orders_customer_id_idx on orders o  (cost=71.25..1372.71 rows=4591 width=23)
         Index Cond: (customer_id = ANY (ARRAY[c.customer_id, $1, $2, ..., $1023]))

Step 4: Change Only Statistics

Now make the filter non-selective:

				
					UPDATE demo.customers
SET region = 'VIP';

ANALYZE demo.customers;

The planner now believes:

● 100% of customers are VIP
● Join result will be large

Plan flips to:

				
					Hash Join
  -> Seq Scan on orders
  -> Hash
       -> Seq Scan on customers

Example:

				
					.                                  QUERY PLAN
--------------------------------------------------------------------------------
 Hash Join  (cost=9704.64..561936.70 rows=5000000 width=19)
   Hash Cond: (o.customer_id = c.customer_id)
   ->  Seq Scan on orders o  (cost=20.00..483502.06 rows=5000000 width=23)
   ->  Hash  (cost=8434.64..8434.64 rows=100000 width=4)
         ->  Seq Scan on customers c  (cost=20.00..8434.64 rows=100000 width=4)
               Storage Filter: (region = 'VIP'::text)
(6 rows)

Key Insight

Same schema. Same query. Same indexes.

Only the statistics changed and the optimizer changed its strategy.

Step 3: Dump the schema + stats + planner context

Run cbo_stat_dump pointing at the database and the query file.

First, make sure you are in the project directory and activate the virtual environment:

				
					cd cbo_stat_dump
source venv/bin/activate

You should now see (venv) in your shell prompt.

Now run the dump:

				
					./cbo_stat_dump \
  --yb_mode \
  -h <yb_host> -p 5433 -d yugabyte -u yugabyte \
  -q ./query.sql \
  -o /tmp/cbo_demo_dump

This captures:

● Schema (DDL)
● Planner GUC overrides
● Yugabyte gFlags
● Table statistics
● The query and captured plan

Note: The script’s CLI options and YB mode are documented in the README.

You now have a portable snapshot of the optimizer’s inputs stored in files like:

● /tmp/cbo_demo_dump/ddl.sql
● /tmp/cbo_demo_dump/import_statistics.sql
● /tmp/cbo_demo_dump/query_plan.txt
● etc.

You can exit from the virtual environemnt using:

				
					deactivate

Optional: Package the Dump for Sharing

Packaging the dump into a single .tar.gz file makes it easy to share with your friendly YugabyteDB SE or support engineer… without moving any customer data:

				
					tar -czf cbo_demo_dump.tar.gz -C /tmp cbo_demo_dump

Step 4: Reproduce in a clean environment (no customer data)

Create a fresh database:

				
					ysqlsh -h <host> -p 5433 -U yugabyte -c "CREATE DATABASE clean_db;"

Load schema only:

				
					#1. Create schema
ysqlsh -p 5433 -U yugabyte -d clean_db -c "CREATE SCHEMA IF NOT EXISTS demo;"

#2. Create objects
ysqlsh -h <host> -p 5433 -U yugabyte -d clean_db -f /tmp/cbo_demo_dump/ddl.sql

Tables are empty.

Now import statistics and extended statistics:

				
					# 1. Import statistics
ysqlsh -h <host> -p 5433 -U yugabyte -d clean_db -f /tmp/cbo_demo_dump/import_statistics.sql

# 2. Import extended statistics
ysqlsh -h <host> -p 5433 -U yugabyte -d clean_db -f /tmp/cbo_demo_dump/import_statistics_ext.sql

Step 7: Observe Optimizer Behavior (No Data Required)

Run:

				
					EXPLAIN
SELECT o.order_id, o.order_total
FROM demo.orders o
JOIN demo.customers c
  ON o.customer_id = c.customer_id
WHERE c.region = 'VIP';

Even though the tables contain zero rows:

● pg_class.reltuples says they’re large
● pg_statistic says region selectivity is low

The planner will choose the same join strategy as production.

You recreated optimizer behavior… not data.

Important Clarification

This approach recreates optimizer decisions and overall plan shape.

It does not guarantee identical cost numbers, row estimates, or runtime timing in every environment.

Why This Is Powerful

This technique allows you to:

● Debug join plan changes after upgrades
● Reproduce regressions without copying millions of rows
● Share optimizer-state test cases with engineering
● Compare cost model behavior across versions
● Investigate plan instability safely

In distributed systems, moving data is expensive.

Moving statistics is lightweight.

Final Takeaway

The cost-based optimizer does not think in data… it thinks in statistics.

cbo_stat_dump lets you capture the optimizer’s mental model and replay it elsewhere.

When a join flips in production and no one knows why, this utility gives you a controlled way to recreate the decision process, without ever touching customer data.

And that’s the difference between speculation and systematic debugging.

Have Fun!