Don’t Forget to Pre-Split Large Indexes

When designing for scale in YugabyteDB, we often think about pre-splitting large tables.

But indexes matter too.

In YugabyteDB, secondary indexes are distributed objects backed by their own tablets. That means a large index can become a bottleneck just like a large table if it starts with too few tablets.

This is especially important when:

  • ● Creating indexes on tables that already contain a large amount of data
  • ● Migrating large PostgreSQL tables into YugabyteDB
  • ● Backfilling indexes on high-volume tables
  • ● Loading billions of rows into tables with secondary indexes
  • ● Running ingestion-heavy workloads where a small number of tablets receive most of the writes
Tip: Pre-splitting is not only for base tables. Secondary indexes are also distributed across tablets, so large indexes may need their own explicit tablet count at creation time.

The Problem

Imagine a table with billions of rows.

The base table may already have a reasonable number of tablets, but the secondary indexes may not have been explicitly pre-split at creation time.

That means the index starts with a small number of tablets and relies on automatic tablet splitting later.

For a large ingestion or index backfill workload, this can create a temporary, but very real, hot shard problem.

The pattern often looks like this:

  • 1. Initial index writes concentrate on a small number of tablets.
  • 2. One or more nodes become much hotter than the rest of the cluster.
  • 3. Automatic tablet splitting eventually kicks in after size thresholds are reached.
  • 4. The split process takes time and can trigger additional compaction work.
  • 5. Until the index catches up, the workload remains skewed.

Automatic splitting helps, but it is not instant. It is a reactive mechanism, not a substitute for intentional sharding when you already know the object will be large.

Example: Tablet Distribution Tells the Story

In one schema review, most objects had a conservative tablet count, while several larger or faster-growing objects had noticeably higher counts.

Object Observed
Tablets
Notes
Most objects 18 Roughly 1 per node; conservative starting point
large_child_table_a 33 Growing, but still potentially underprovisioned
large_lookup_table_b 37 Requires review based on size and write rate
transaction_table_c 40 Large and ingestion-sensitive
catalog_table_d 91 Previously higher; tablet count changed after truncate/reload activity
primary_fact_table 120 Largest table, multi-billion-row scale

At the same time, Performance Advisor showed hot shard evidence, including strong CPU skew on one node relative to the rest of the cluster.

That is a classic signal that the workload is not being evenly distributed across tablets.

Why This Happens

When a large index is created without enough initial tablets, YugabyteDB has to write the index entries into the tablets that exist at that moment.

For an empty or small object, this may be fine.

For a very large existing table, it can be a problem because the index backfill has to process a large amount of data immediately.

In other words:

  • ● The index is logically new
  • ● The data it needs to index is not small
  • ● The initial tablet count may be too low
  • ● The backfill can hammer a small number of tablets
  • ● Auto-splitting only helps after thresholds are crossed

That gap between “index creation starts” and “tablet splitting catches up” is where hot shards can appear.

The Pre-Splitting Gap

In the environment reviewed, some large indexes had not been pre-split at creation time and instead relied entirely on automatic splitting after the fact.

For very large objects, that creates an important timing gap:

  • 1. Initial writes concentrate on a small number of tablets per index
  • 2. Auto-splitting begins only after the split threshold is exceeded
  • 3. The time between initial population and split completion can create sustained hot shards

That is why pre-splitting matters most when:

  • ● Creating an index on a table that already contains a large amount of data
  • ● Bulk-loading data into a table with one or more secondary indexes
  • ● Building indexes during a migration or post-migration optimization phase
  • ● Backfilling indexes on large, active tables
Watch out: Automatic tablet splitting is reactive. For very large existing tables, an index backfill can create a hot shard window before auto-splitting has time to catch up.

The Key Point

Pre-splitting is not just a table design decision.

It is also an index design decision.

If a secondary index will be built on a very large table, or if it will immediately receive a large volume of writes, consider pre-splitting the index at creation time.

For example:

				
					CREATE INDEX idx_large_table_lookup_key
ON large_table_a (lookup_key HASH, entity_id)
SPLIT INTO 120 TABLETS;
				
			

Or:

				
					CREATE INDEX idx_event_table_reference_id
ON event_table_b (reference_id HASH)
SPLIT INTO 120 TABLETS;
				
			

The exact number of tablets depends on cluster size, table size, workload shape, expected growth, CPU/core count, and how many other tablets already exist in the universe.

The goal is not to create as many tablets as possible.

The goal is to create enough tablets so the initial write, load, or backfill work can be distributed across the cluster from the beginning.

Important: When you create a secondary index in YugabyteDB, the index does not automatically inherit or match the tablet count of the base table. The index is its own distributed object with its own tablets. If the base table has many tablets, but the index is created without explicit pre-splitting, the index can still start under-sharded.

Why Auto-Splitting Alone May Not Be Enough

Automatic tablet splitting is valuable, and in many cases it is exactly what you want.

But for known large objects, especially during migration or index backfill, auto-splitting can be too late to prevent the initial bottleneck.

Automatic splitting happens only after a tablet reaches the configured split threshold. That means the system has to:

  • ● Detect the growth
  • ● Split the tablet
  • ● Redistribute future work over time
  • ● Let the cluster settle after the split

For steady growth, that may be perfectly acceptable.

For a massive initial load or index backfill, it can leave a painful hot shard window.

Practical Guidance

Use this rule of thumb:

If the table is already large, do not treat the index as small.

Before creating indexes on large populated tables, review:

  • ● Table size
  • ● Row count
  • ● Expected index entry count
  • ● Write and ingest rate
  • ● Existing tablet count
  • ● Number of nodes and cores
  • ● Whether the index key can distribute writes evenly
  • ● Whether the index needs hash sharding, range sharding, or a different key order

For large tables, it is often better to explicitly pre-split the index instead of waiting for automatic splitting to react.

Rule of thumb: If the table is already large, do not treat the index as small. Review and pre-split large secondary indexes before the load, migration, or backfill begins.

Review Both Tables and Indexes

When reviewing schema design, do not stop at the base tables.

Also review the tablet counts and design of the indexes.

Look for cases where:

  • ● The base table has many tablets
  • ● A large secondary index has very few tablets
  • ● A hot node owns leaders for the busiest tablets
  • ● Performance Advisor reports hot shards
  • ● Write-heavy indexes are concentrated on a small number of tablets

Do Not Forget Index Key Design

Pre-splitting helps distribute tablets, but it does not fix a poor sharding key.

For example, an index on a low-cardinality column such as a status, boolean flag, category type, or date bucket can still concentrate writes if most rows land on the same values.

Before pre-splitting, also ask:

  • ● Is the leading index column selective enough?
  • ● Are many rows using the same value?
  • ● Are many indexed values NULL?
  • ● Is this a monotonically increasing key?
  • ● Would a different index column order distribute writes better?
  • ● Should this be hash-sharded or range-sharded?

Pre-splitting gives YugabyteDB more tablets to work with.

A good sharding key gives YugabyteDB a better chance of using those tablets evenly.

Migration Tip

During PostgreSQL migrations, this is easy to miss.

PostgreSQL indexes live inside one database instance. YugabyteDB indexes are distributed across tablets.

So when moving a large PostgreSQL table into YugabyteDB, do not only review the table DDL…. review the index DDL too.

Large indexes may need explicit SPLIT INTO clauses before data loading or index backfill begins.

Final Takeaway

Automatic tablet splitting is a great safety net, but it should not be your only plan for known large tables or indexes.

If you are creating an index on a table that already has hundreds of millions or billions of rows, pre-split the index at creation time. Otherwise, the index backfill can concentrate writes on too few tablets, create hot shards, and leave the cluster waiting for automatic splitting to catch up.

Pre-split large indexes up front so the work is distributed from the beginning.

Have Fun!

"God has left Africa..." but he definitely blessed Hawaii. 🙌 Stoked to share this picture from our tour at Kualoa Ranch. My wife tolerated me geeking out over the filming locations for Tears of the Sun, which is easily one of my favorite movies of all time. Incredible history, even better views! 🚁🥾