How to Speed Up YSQL Index Backfill (Especially on Large Tables)

Online index creation in YugabyteDB is powerful… you can add columns, create indexes, and keep the application running without blocking reads or writes.

But when a table is large (hundreds of GB or hundreds of millions of rows), index backfill can take a long time, and default settings are intentionally conservative to protect live workloads.

This tip summarizes why backfill can be slow, what makes it retry work, and, most importantly, how to safely speed it up.

🔍 What Is Index Backfill?

When you run CREATE INDEX on a table that already contains data:

YugabyteDB scans the existing rows in the base table.
It writes corresponding index entries into the new or rebuilt index.
Meanwhile, new writes to the table are also copied into the index to maintain consistency.

This is why index creation is online… queries continue to run normally.

🐢 Why Can Index Backfill Take a Long Time?

1️⃣ Large amount of data

Backfill must read and write every existing row. On a 143 GB table, this may be hundreds of millions of rows. Every row must be scanned, transformed, and written into the index’s distributed storage.

2️⃣ Conservative defaults

For safety, YugabyteDB limits backfill aggressiveness:

Setting	Default	Meaning
`backfill_index_write_batch_size`	128	Rows per batch
`num_concurrent_backfills_allowed`	auto-tuned (~8 per node on >16 cores)	Parallel backfill workers
Several backfill RPC timeouts	Relatively small	Protect cluster from long-running/stalled operations

Safe defaults prevent overload but slow backfill dramatically on large datasets.

3️⃣ Timeouts that trigger retries

A batch might run slowly due to:

● leader changes
● network hiccups
● saturated CPU/IO
● uneven tablet distribution

If the batch does not finish before its RPC deadline, the master retries the batch… often restarting work from the beginning.

Large batches make retries more expensive, and repeated retries can cause hours of unnecessary delay.

4️⃣ Distributed nature of index tables

Since index tablets may live on different nodes, the system will:

● read from one set of nodes,
● write to others,
● coordinate RPCs across the cluster.

This can become a bottleneck on:

● clusters under heavy load
● large-scale deployments
● environments with noisy neighbors or variable network latency

5️⃣ Transaction “safe time” delays

Before backfill starts, YugabyteDB ensures no long-running transactions can see a partial index. If transactions are stuck or slow to commit, backfill may stall waiting for a “safe time.”

Controlled by:
index_backfill_wait_for_old_txns_ms

🚀 How to Make Index Backfill Much Faster

Here are the most effective tuning options, with practical guidance from real-world scenarios.

1️⃣ Increase backfill_index_write_batch_size (biggest performance boost)

Default: 128
Recommended starting point: 2000-4000

Clusters have seen 2× improvement in index build time by increasing this value.

However, larger batches must be paired with larger RPC timeouts (next section). Otherwise, batches may time out and retry, negating the performance improvement.

2️⃣ Increase backfill RPC and grace-period timeouts

When increasing batch size above ~1000, bump timeouts so the system does not misinterpret long-running batches as failures.

Common values:

				
					ysql_index_backfill_rpc_timeout_ms = 300000–600000   # 300–600s
backfill_index_timeout_grace_margin_ms = 30000–60000 # 30–60s
backfill_index_client_rpc_timeout_ms = 30000–60000   # 30–60s

Why this matters:

● Prevents premature timeouts
● Avoids reprocessing entire batches
● Stabilizes progress under load

3️⃣ Understand and leverage parallelism

YugabyteDB automatically configures num_concurrent_backfills_allowed as:

				
					num_concurrent_backfills_allowed = min(8, num_cores / 2)

Meaning:

● A node with 16+ cores → up to 8 parallel backfill workers
● A 3-node cluster can achieve up to 24 parallel tasks

Each task spawns a dedicated PG backend.

If the cluster is not resource-constrained, the default parallelism is generally sufficient.

4️⃣ Optional: Rate-limit backfill

You can intentionally slow backfill (on a per-tablet basis) to avoid overwhelming the cluster via the backfill_index_rate_rows_per_sec flag.

Why this matters:

● Prevents premature timeouts
● Avoids reprocessing entire batches
● Stabilizes progress under load

The default is 0, which means no explicit rate limit is enforced by this flag.

5️⃣ Avoid noisy or unstable cluster conditions

Backfill is sensitive to:

● leadership flapping
● high CPU (>85%)
● memory pressure
● slow storage
● network jitter

Best practice:

● Run backfills during low-traffic windows
● Stabilize leadership first
● Ensure nodes aren’t heavily loaded

6️⃣ Troubleshooting common backfill stalls

If you see:

				
					ERROR: timed out waiting for postgres backends to catch up
DETAIL:  1 backends on database 13515 are still behind catalog version 2.

You can find lagging PG backends with:

				
					SELECT *
FROM pg_stat_activity
WHERE backend_type != 'walsender'
  AND catalog_version < <target_version>
  AND datid = <db_oid>;

Note: From the DETAIL line above, 13515 is the datid, and 2 is the target_version.

Slow catalog-version advancement usually means:

● Overloaded PG backend
● Timeout too small for the configured batch size

If this happens, reduce the batch size (e.g., from 4000 → 2000) or increase timeouts.

🧪 Recommended Starting Point for Large Table Backfills

For clusters with adequate CPU and stable workload:

				
					backfill_index_write_batch_size = 2000   # or 4000 if cluster is very stable
ysql_index_backfill_rpc_timeout_ms = 300000
backfill_index_timeout_grace_margin_ms = 60000
backfill_index_client_rpc_timeout_ms = 300000

If you see retries or catalog lag:

● Reduce batch size (2000 → 1000)
● Increase timeouts further
● Reduce system load

📌 Final Guidance

If you’re adding columns and recreating or backfilling indexes on a large table:

● Expect the operation to be CPU/IO intensive
● Plan a tuning pass before running it on production-sized data
● Start with higher batch sizes + tuned timeouts
● Monitor for backfill retries or catalog lag
● Validate that the cluster has enough headroom for parallel backfills

With correct tuning, even huge backfills can complete several times faster while still running online.

Have Fun!