How to Diagnose and Fix Stalled Index Backfills in YugabyteDB

Index backfill in YugabyteDB allows you to create new indexes on large tables without blocking reads or writes. However, backfill can stall if one or more Postgres backends are still operating with an older catalog version. When this happens, index creation may appear hung or eventually time out with errors indicating that certain backends have not “caught up.”

This tip explains:

  • ● Why catalog version lag matters

  • ● How to identify the backend blocking progress

  • ● How to fix the issue instantly

  • ● How to reproduce the scenario for training and testing

🧠 Why Catalog Versions Matter for Index Backfill

Modern YugabyteDB maintains per-database catalog versions, controlled by:

				
					--ysql_enable_db_catalog_version_mode=true   (default)
				
			

This means:

  • ✔ Each database has its own catalog version, stored in:
				
					SELECT * FROM pg_yb_catalog_version;
				
			
  • ✔ DDL such as CREATE INDEX or ALTER TABLE, only increments the catalog version for that specific database, not the whole cluster.
  • ✔ Only backends connected to that database must refresh their local catalog cache to match the new version.

  • ✔ Backfill errors correctly reference database OID (datid)
      • Example:
				
					DETAIL: 1 backends on database 13515 are still behind catalog version 2.
				
			
  • ✔ Legacy: Cluster-wide catalog version still exists
      • Reported via:
				
					yb-admin ysql_catalog_version
				
			
    • …but on modern clusters this value is often always 0 and is not used for backfill coordination.
🧩 Why Catalog Version Lag Stalls Index Backfill

During index backfill, the system must ensure catalog metadata consistency across all active backends in the database.

Backfill cannot proceed unless:

  • Every backend attached to the target database reaches the new catalog version.

If even one backend is still working with the old version, the backfill pauses, retries, and may time out.

Common causes:

  • ● Idle-in-transaction sessions

  • ● Long-running queries (analytics, reports, complex scans)

  • ● Stuck or abandoned application connections

  • ● Connection pools holding open transactions

  • ● Connection Manager walsenders (should be ignored)

🔍 Diagnostic Query

Use this to detect catalog-version lag:

				
					SELECT
    d.datname,
    a.pid,
    a.usename,
    a.backend_type,
    a.state,
    a.query,
    a.query_start,
    a.catalog_version
FROM pg_stat_activity a
JOIN pg_database d ON a.datid = d.oid
WHERE a.backend_type NOT IN ('walsender', 'yb-conn-mgr walsender')
  AND d.datname = current_database()
ORDER BY a.catalog_version, a.pid;
				
			

To filter only lagging backends:

				
					AND a.catalog_version < <target_version>
				
			

Get <target_version> directly from your backfill timeout error.

Example:

				
					ERROR:  timed out waiting for postgres backends to catch up
DETAIL:  1 backends on database 13515 are still behind catalog version 2.
				
			

In the above DETAIL line, 13515 is the datid (of most likely the current database) and 2 is the target catalog version.

🎨 Example Output (from a real session)
				
					 datname   |   pid   | usename  |  backend_type  |        state        |                  query                  |          query_start          | catalog_version
----------+---------+----------+----------------+---------------------+-----------------------------------------+-------------------------------+-----------------
 yugabyte | 1615032 | yugabyte | client backend | idle in transaction | SELECT txid_current();                  | 2025-12-08 00:45:30.48256+00  |               1
 yugabyte | 1615430 | yugabyte | client backend | active              | CREATE INDEX CONCURRENTLY ...           | 2025-12-08 00:46:26.257648+00 |               2
 yugabyte | 1615556 | yugabyte | client backend | active              | SELECT ... FROM pg_stat_activity ...    | 2025-12-08 00:47:20.967342+00 |               2

				
			

Interpretation:

  • PID 1615032
    • state = 'idle in transaction'
    • catalog_version = 1
      • → This backend is blocking the index backfill.
  • PIDs 1615430 & 1615556
    • ○ Already updated to catalog_version = 2.

Fix immediately:

				
					SELECT pg_terminate_backend(1615032);
				
			

Backfill resumes automatically.

🧮 “How Far Behind” Query

This helps quantify the lag:

				
					WITH target AS (
    SELECT <target_version> AS v
)
SELECT
    d.datname,
    a.pid,
    a.usename,
    a.state,
    a.catalog_version,
    (t.v - a.catalog_version) AS versions_behind,
    a.query
FROM pg_stat_activity a
JOIN target t ON TRUE
JOIN pg_database d ON a.datid = d.oid
WHERE a.backend_type NOT IN ('walsender', 'yb-conn-mgr walsender')
  AND d.datname = current_database()
ORDER BY versions_behind DESC, a.pid;
				
			
🧪 Hands-On: Reproduce the Issue (Great for Learning)

These examples let you reproduce catalog version lag and watch how backfill reacts.

Example 1️⃣ Idle-In-Transaction Blocking Backfill

Step 1: Create data

				
					CREATE TABLE demo_backfill (
  id    BIGINT PRIMARY KEY,
  value TEXT
);

INSERT INTO demo_backfill (id, value)
SELECT g, 'value-' || g
FROM generate_series(1, 500000) g;
				
			

Step 2 (Session A): Create the blocking backend

				
					BEGIN;
SELECT txid_current();
-- Leave this session open
				
			

The backend now holds the old catalog snapshot.

Step 3 (Session B): Trigger backfill

				
					CREATE INDEX CONCURRENTLY demo_backfill_idx_value
ON demo_backfill (value);
				
			

This may eventually timeout:

				
					ERROR: timed out waiting for postgres backends to catch up
				
			

Step 4 (Session C): Diagnose & fix

Run the diagnostic query, identify the idle in transaction backend, then resolve:

				
					SELECT pg_terminate_backend(<pid>);
				
			

Backfill resumes instantly.

Example 2️⃣ Long-Running Query Blocking Backfill

Step 1: Create a table and load data (run once in any session)

				
					CREATE TABLE demo_backfill (
  id    BIGINT PRIMARY KEY,
  value TEXT
);

INSERT INTO demo_backfill (id, value)
SELECT g, 'value-' || g
FROM generate_series(1, 500000) g;
				
			

This gives enough data to produce a meaningful scan.

Step 2 (Session A): Start a slow-running query

				
					SELECT COUNT(*)
FROM demo_backfill t,
     LATERAL (
        SELECT *
        FROM generate_series(1, 2000) AS g
     ) AS x;
				
			

Why this works:

  • ● The generate_series(1, 2000) cross join causes ~1 billion row iterations.

  • ● It will run long enough to reliably interfere with DDL.

This backend is now holding an old catalog snapshot and will not advance to a new catalog version until the query completes.

Step 3 (Session B): Start the index backfill

				
					CREATE INDEX CONCURRENTLY demo_backfill_idx_value2
ON demo_backfill (value);
				
			

If the slow query is still running, this will eventually time out with an error like:

				
					ERROR: timed out waiting for postgres backends to catch up
DETAIL: 1 backends on database <db_oid> are still behind catalog version <version>.
				
			

This mirrors real-world production issues.

Step 4 (Session C): Identify the blocking backend

				
					SELECT
    d.datname,
    a.pid,
    a.usename,
    a.backend_type,
    a.state,
    a.query_start,
    a.query,
    a.catalog_version
FROM pg_stat_activity a
JOIN pg_database d ON a.datid = d.oid
WHERE a.backend_type NOT IN ('walsender', 'yb-conn-mgr walsender')
  AND d.datname = current_database()
ORDER BY a.catalog_version, a.pid;
				
			

You should see:

  • ● Session A’s long-running query at catalog_version = old_version

  • ● Backfill backend and your diagnostic session at catalog_version = new_version

Step 4: Fix the stall

Terminate or cancel the blocking backend:

				
					SELECT pg_cancel_backend(<pid>);
				
			

If necessary:

				
					SELECT pg_terminate_backend(<pid>);
				
			

The moment Session A ends, the backend refreshes its catalog cache and backfill proceeds.

Conclusion

Index backfill can stall when any backend in the target database is still operating with an outdated catalog version. This typically happens due to idle transactions or long-running queries holding old metadata snapshots. 

With the diagnostic queries shown in this tip, you can quickly identify the lagging backend, resolve the issue, and allow backfill to continue without restarting the index creation process.

Understanding catalog-version behavior, and how to troubleshoot it, turns backfill stalls from a confusing problem into a predictable and easily recoverable operational scenario.

Have Fun!

Upgrading a childhood memory 🚂✨ My brand-new Bachmann train set finally replaces the one I got as a kid (which has been broken for years). Still have the original box from Gee Bees, long gone, but the memories aren’t. And yes… the engines are Santa Fe again, just like when I was little.