Use Query Diagnostics to Investigate Real-World Queries

YugabyteDB’s new Query Diagnostics feature, avaiable is release 2025.1.1.1, lets you capture a rich diagnostic bundle for a particular query over a window of time. You get bind variable/constant logging, EXPLAIN(ANALYZE, DIST) plans, session history, schema context, and aggregated statistics … all exported for offline analysis.

This is especially helpful when a query intermittently degrades (e.g. due to parameter skew, data growth, or unexpected join patterns). Instead of guessing, you can record the truth of what was happening on the cluster.

Here’s how you might use it in a “national parks” domain.

Example: Diagnosing a slow “popular route” query in a Parks app

First, let’s create a single node YugabyteDB cluster using yugabyted setting the appropriate gFlags to enable the Query Diagnostics feature:

				
					./yugabyted start --tserver_flags="allowed_preview_flags_csv={ysql_yb_enable_query_diagnostics},ysql_yb_enable_query_diagnostics=true" > start1.log

./yugabyted start --tserver_flags="allowed_preview_flags_csv={ysql_yb_enable_query_diagnostics},ysql_yb_enable_query_diagnostics=true" --master_flags="allowed_preview_flags_csv={ysql_yb_enable_query_diagnostics},ysql_yb_enable_query_diagnostics=true" > start1.log

Imagine you’re building a national parks analytics app. You have two tables:

				
					DROP TABLE IF EXISTS visitor_routes CASCADE;
DROP TABLE IF EXISTS parks CASCADE;

CREATE TABLE parks (
  park_id SERIAL PRIMARY KEY,
  name TEXT NOT NULL,
  state TEXT NOT NULL,
  established_date DATE,
  area_sq_mi NUMERIC
);

CREATE TABLE visitor_routes (
  route_id SERIAL PRIMARY KEY,
  park_id INT NOT NULL REFERENCES parks(park_id),
  start_point TEXT NOT NULL,
  end_point TEXT NOT NULL,
  start_state TEXT NOT NULL,
  end_state TEXT NOT NULL,
  distance_mi NUMERIC,
  visitors_last_year INT,
  created_at TIMESTAMP DEFAULT now()
);

-- Let’s insert some National Parks across multiple states, and generate synthetic visitor routes for each.
INSERT INTO parks (name, state, established_date, area_sq_mi) VALUES
 ('Yellowstone', 'WY', '1872-03-01', 3471),
 ('Yosemite', 'CA', '1890-10-01', 1169),
 ('Grand Canyon', 'AZ', '1919-02-26', 1902),
 ('Rocky Mountain', 'CO', '1915-01-26', 415),
 ('Zion', 'UT', '1919-11-19', 229),
 ('Everglades', 'FL', '1934-05-30', 1508),
 ('Acadia', 'ME', '1919-02-26', 47),
 ('Arches', 'UT', '1929-04-12', 120),
 ('Glacier', 'MT', '1910-05-11', 1583),
 ('Olympic', 'WA', '1938-06-29', 1442);

-- Now populate a few hundred visitor_routes using random data:
DO $$
DECLARE
  park RECORD;
  route_count INT;
  start_state TEXT;
  end_state TEXT;
  i INT;
BEGIN
  FOR park IN SELECT * FROM parks LOOP
    route_count := (50 + random() * 100)::INT;

    FOR i IN 1..route_count LOOP
      IF random() < 0.8 THEN
        start_state := park.state;
        end_state := park.state;
      ELSE
        SELECT state INTO start_state FROM parks ORDER BY random() LIMIT 1;
        SELECT state INTO end_state FROM parks ORDER BY random() LIMIT 1;
      END IF;

      INSERT INTO visitor_routes (
        park_id, start_point, end_point,
        start_state, end_state,
        distance_mi, visitors_last_year
      )
      VALUES (
        park.park_id,
        'Trailhead_' || (1000 + random()*9000)::INT,
        'Lookout_' || (1000 + random()*9000)::INT,
        start_state, end_state,
        round((1 + random() * 30)::numeric, 2),
        (100 + random() * 5000)::INT
      );
    END LOOP;
  END LOOP;
END $$;

Here’s a query that sometimes runs fast, sometimes slow … perfect for yb_query_diagnostics.

It joins and filters based on state and route distance, so performance varies by data distribution.

				
					SELECT
  p.name AS park_name,
  vr.start_point,
  vr.end_point,
  vr.distance_mi,
  vr.visitors_last_year
FROM
  parks p
  JOIN visitor_routes vr ON vr.park_id = p.park_id
WHERE
  p.state = vr.start_state
  AND vr.distance_mi > 15
ORDER BY
  vr.visitors_last_year DESC
LIMIT 20;

Suppose occasionally this query runs slowly (e.g. 500ms), especially when parks have many routes.

You find the query’s queryid via pg_stat_statements:

				
					SELECT
  queryid,
  calls,
  total_exec_time,
  max_exec_time,
  mean_exec_time,
  left(regexp_replace(query, '\s+', ' ', 'g'), 180) AS sample_query
FROM pg_stat_statements
WHERE query ILIKE '%distance_mi > 15%'        -- adjust the needle if needed
ORDER BY total_exec_time DESC
LIMIT 5;

Say you get queryid = 9223372036854775810.

Now trigger diagnostics:

				
					SELECT yb_query_diagnostics(
  query_id => 9223372036854775810,
  explain_sample_rate => 20,        -- capture EXPLAIN for ~20% of runs
  diagnostics_interval_sec => 120,  -- record for 2 minutes
  explain_analyze => true,
  explain_dist => true,
  bind_var_query_min_duration_ms => 50
);

While the diagnostics run, your cluster logs will collect:

● bind_variable/constant logs (for runs >50 ms)
● EXPLAIN + ANALYZE + distributed plan snapshots
● Active session history (to see wait events, I/O stalls)
● Schema context (so you know which indexes/column stats were in effect)
● Aggregated pg_stat_statements metrics for that time window

Once it completes, you’ll see a folder path in the result. Use yb_query_diagnostics_status to check progress:

				
					SELECT * FROM yb_query_diagnostics_status
 WHERE query_id = 9223372036854775810;

All-in-One Helper Script (Making our simple demo even simpler!)

Here’s a fully working Bash script that:

Warms up the query
Starts yb_query_diagnostics()
Runs the query repeatedly during the capture window
Waits for completion
Packages the non-empty bundle automatically

Save as yb_run_query_diagnostics.sh:

				
					#!/usr/bin/env bash
# Usage: ./yb_run_query_diagnostics.sh [num_runs_during_capture] [duration_sec]
# Example: ./yb_run_query_diagnostics.sh 200 120
set -euo pipefail

NUM_RUNS="${1:-200}"
DURATION="${2:-120}"

QUERY=$'SELECT\n  p.name AS park_name,\n  vr.start_point,\n  vr.end_point,\n  vr.distance_mi,\n  vr.visitors_last_year\nFROM\n  parks p\n  JOIN visitor_routes vr ON vr.park_id = p.park_id\nWHERE\n  p.state = vr.start_state\n  AND vr.distance_mi > 15\nORDER BY\n  vr.visitors_last_year DESC\nLIMIT 20;'

say(){ printf "%s\n" "$*" >&2; }
need(){ command -v "$1" >/dev/null || { say "❌ Missing: $1"; exit 1; }; }
need ysqlsh; need tar

say "🧹 Resetting pg_stat_statements..."
ysqlsh -qAtX -c "SELECT pg_stat_statements_reset();" >/dev/null || true

say "🔥 Warmup: running query once..."
ysqlsh -qAtX -c "$QUERY" >/dev/null || true

say "🔍 Locating queryid..."
QUERYID="$(ysqlsh -qAtX <<'SQL'
SELECT queryid
FROM pg_stat_statements
WHERE query ILIKE '%visitor_routes%' AND query ILIKE '%parks%'
ORDER BY total_exec_time DESC
LIMIT 1;
SQL
)"
[[ -n "$QUERYID" ]] || { say "❌ No queryid found."; exit 1; }
say "✅ queryid: $QUERYID"

say "🚀 Launching yb_query_diagnostics() for ${DURATION}s..."
DIAG_RESULT="$(ysqlsh -qAtX <<SQL
SELECT yb_query_diagnostics(
  query_id => ${QUERYID},
  explain_sample_rate => 25,
  diagnostics_interval_sec => ${DURATION},
  explain_analyze => true,
  explain_dist => true,
  bind_var_query_min_duration_ms => 0
);
SQL
)"
say "ℹ️ yb_query_diagnostics() returned:"
say "$DIAG_RESULT"

FUNC_PATH="$(printf "%s\n" "$DIAG_RESULT" | awk '/\//{print; exit}')"

say "🏃 Running query ${NUM_RUNS} times during capture..."
for i in $(seq 1 "$NUM_RUNS"); do
  ysqlsh -qAtX -c "$QUERY" >/dev/null || true
done

say "⏳ Waiting ${DURATION}s for capture window to finish..."
sleep "$DURATION"

# Find the freshest non-empty folder under query-diagnostics
BASE="/root/var/data/pg_data/query-diagnostics/${QUERYID}"
FOUND_PATH="$(find "$BASE" -type f -printf '%T@ %p\n' 2>/dev/null | sort -nr | head -n1 | awk '{print $2}' | xargs dirname)"

[[ -n "$FOUND_PATH" && -d "$FOUND_PATH" ]] || { say "❌ Could not find diagnostic output."; exit 1; }

say "📂 Found diagnostics at: $FOUND_PATH"
OUT_TGZ="query_diag_$(date +%Y%m%d_%H%M%S).tar.gz"
tar -czf "$OUT_TGZ" -C "$(dirname "$FOUND_PATH")" "$(basename "$FOUND_PATH")"
say "✅ Packaged: $OUT_TGZ"

Run it:

				
					chmod +x yb_run_query_diagnostics.sh
./yb_run_query_diagnostics.sh 200 120

Example:

				
					[root@localhost ~]# chmod +x yb_run_query_diagnostics.sh

[root@localhost ~]# ./yb_run_query_diagnostics.sh 200 120
🧹 Resetting pg_stat_statements...
🔥 Warmup: running query once...
🔍 Locating queryid...
✅ queryid: -806893278141218758
🚀 Launching yb_query_diagnostics() for 120s...
ℹ️ yb_query_diagnostics() returned:
/root/var/data/pg_data/query-diagnostics/-806893278141218758/3565569516/
🏃 Running query 200 times during capture...
⏳ Waiting 120s for capture window to finish...
📂 Found diagnostics at: /root/var/data/pg_data/query-diagnostics/-806893278141218758/3565569516
✅ Packaged: query_diag_20251008_134853.tar.gz

[root@localhost ~]# mkdir diag

[root@localhost ~]# tar -xzf query_diag_20251008_134853.tar.gz -C diag

[root@localhost ~]# find diag -maxdepth 3 -type f | head -n 30
diag/3565569516/explain_plan.txt
diag/3565569516/constants_and_bind_variables.csv
diag/3565569516/pg_stat_statements.csv
diag/3565569516/active_session_history.csv
diag/3565569516/schema_details.txt

What You Can Learn

1️⃣ Bind variable skew
constants_and_bind_variables.csv shows if one state (like ‘CA’) causes heavier load.

2️⃣ Plan changes
Compare multiple EXPLAIN ANALYZE plans to see if the optimizer picked different paths over time.

3️⃣ Distributed cost
Look in explain_analyze_dist.txt … are certain nodes doing most of the work? That’s a data placement or join locality clue.

4️⃣ Wait events
Inspect active_session_history.csv for common waits (e.g., “read_requests”, “network_queue”). Hot tablets or I/O stalls will stand out.

Some examples…

				
					# count EXPLAIN samples
grep -l "Execution Time" diag/**/explain* | wc -l

# top wait events
awk -F, 'NR>1{print $5}' diag/**/active_session_history.csv | sort | uniq -c | sort -nr | head

Summary

● Use yb_query_diagnostics(…) to record a “run window” of real executions with full context.
● Capture EXPLAIN ANALYZE + distributed plans, bind variable logs, and session history all in one go.
● In practice, combine these diagnostics with domain knowledge (e.g. large vs small parks, skewed states) to spot parameter or data distribution issues you wouldn’t see from just one slow query trace.
● After diagnosing, you might add hints, partitioning, index changes, co-location, or query rewrites — then repeat diagnostics if needed to validate.

Have Fun!