Online index creation in YugabyteDB is powerful… you can add columns, create indexes, and keep the application running without blocking reads or writes.
But when a table is large (hundreds of GB or hundreds of millions of rows), index backfill can take a long time, and default settings are intentionally conservative to protect live workloads.
This tip summarizes why backfill can be slow, what makes it retry work, and, most importantly, how to safely speed it up.
π What Is Index Backfill?
When you run CREATE INDEX on a table that already contains data:
YugabyteDB scans the existing rows in the base table.
It writes corresponding index entries into the new or rebuilt index.
Meanwhile, new writes to the table are also copied into the index to maintain consistency.
This is why index creation is online… queries continue to run normally.
π’ Why Can Index Backfill Take a Long Time?
1οΈβ£ Large amount of data
Backfill must read and write every existing row. On a 143 GB table, this may be hundreds of millions of rows. Every row must be scanned, transformed, and written into the indexβs distributed storage.
2οΈβ£ Conservative defaults
For safety, YugabyteDB limits backfill aggressiveness:
Setting
Default
Meaning
backfill_index_write_batch_size
128
Rows per batch
num_concurrent_backfills_allowed
auto-tuned (~8 per node on >16 cores)
Parallel backfill workers
Several backfill RPC timeouts
Relatively small
Protect cluster from long-running/stalled operations
Safe defaults prevent overload but slow backfill dramatically on large datasets.
3οΈβ£ Timeouts that trigger retries
A batch might run slowly due to:
β leader changes
β network hiccups
β saturated CPU/IO
β uneven tablet distribution
If the batch does not finish before its RPC deadline, the master retries the batch… often restarting work from the beginning.
Large batches make retries more expensive, and repeated retries can cause hours of unnecessary delay.
4οΈβ£ Distributed nature of index tables
Since index tablets may live on different nodes, the system will:
β read from one set of nodes,
β write to others,
β coordinate RPCs across the cluster.
This can become a bottleneck on:
β clusters under heavy load
β large-scale deployments
β environments with noisy neighbors or variable network latency
5οΈβ£ Transaction “safe time” delays
Before backfill starts, YugabyteDB ensures no long-running transactions can see a partial index. If transactions are stuck or slow to commit, backfill may stall waiting for a βsafe time.β
Clusters have seen 2Γ improvement in index build time by increasing this value.
However, larger batches must be paired with larger RPC timeouts (next section). Otherwise, batches may time out and retry, negating the performance improvement.
2οΈβ£ Increase backfill RPC and grace-period timeouts
When increasing batch size above ~1000, bump timeouts so the system does not misinterpret long-running batches as failures.
β A node with 16+ cores β up to 8 parallel backfill workers
β A 3-node cluster can achieve up to 24 parallel tasks
Each task spawns a dedicated PG backend.
If the cluster is not resource-constrained, the default parallelism is generally sufficient.
4οΈβ£ Optional: Rate-limit backfill
You can intentionally slow backfill (on aΒ per-tablet basis) to avoid overwhelming the cluster via the backfill_index_rate_rows_per_sec flag.
Why this matters:
β Prevents premature timeouts
β Avoids reprocessing entire batches
β Stabilizes progress under load
The default isΒ 0, which meansΒ no explicit rate limit is enforced by this flag.Β
5οΈβ£ Avoid noisy or unstable cluster conditions
Backfill is sensitive to:
β leadership flapping
β high CPU (>85%)
β memory pressure
β slow storage
β network jitter
Best practice:
β Run backfills during low-traffic windows
β Stabilize leadership first
β Ensure nodes arenβt heavily loaded
6οΈβ£ Troubleshooting common backfill stalls
If you see:
ERROR: timed out waiting for postgres backends to catch up
DETAIL: 1 backends on database 13515 are still behind catalog version 2.
You can find lagging PG backends with:
SELECT *
FROM pg_stat_activity
WHERE backend_type != 'walsender'
AND catalog_version <
AND datid = ;
Note: From the DETAIL line above, 13515 is the datid, and 2 is the target_version.
Slow catalog-version advancement usually means:
β Overloaded PG backend
β Timeout too small for the configured batch size
If this happens, reduce the batch size (e.g., from 4000 β 2000) or increase timeouts.
π§ͺ Recommended Starting Point for Large Table Backfills
For clusters with adequate CPU and stable workload:
backfill_index_write_batch_size = 2000 # or 4000 if cluster is very stable
ysql_index_backfill_rpc_timeout_ms = 300000
backfill_index_timeout_grace_margin_ms = 60000
backfill_index_client_rpc_timeout_ms = 300000
If you see retries or catalog lag:
β Reduce batch size (2000 β 1000)
β Increase timeouts further
β Reduce system load
π Final Guidance
If youβre adding columns and recreating or backfilling indexes on a large table:
β Expect the operation to be CPU/IO intensive
β Plan a tuning pass before running it on production-sized data
β Start with higher batch sizes + tuned timeouts
β Monitor for backfill retries or catalog lag
β Validate that the cluster has enough headroom for parallel backfills
With correct tuning, even huge backfills can complete several times faster while still running online.
Have Fun!
re:Play 2025 Las Vegas Festival Grounds π
An absolutely crazy, crazy night at the AWS party with some of my favorite work besties! Live music, great food, wild energy... couldnβt have asked for a better crew. π₯π