Turning a Slow ORDER BY ... LIMIT Query into an Index-Only Scan

A common tuning pattern in YugabyteDB is finding a query that looks simple, has indexes available, and still performs inconsistently.

This often happens when the query filters on several columns, orders by a primary identifier, and fetches only a small batch of rows.

For example:

				
					SELECT r.id
FROM place_records r
WHERE r.processing_status IN ('PENDING')
  AND r.processed_entity_id IS NULL
  AND r.category IN ('Coffee Shop')
  AND r.locality = 'LargeCity'
  AND r.region = 'ExampleState'
  AND r.country_code = 'US'
ORDER BY r.id
LIMIT 100;

At first glance, this query looks easy to optimize. It has equality filters, a small LIMIT, and an ORDER BY.

But the index choice matters a lot.

The Starting Point

Assume the table has several single-column indexes:

				
					CREATE INDEX idx_place_records_locality
ON place_records (locality HASH);

CREATE INDEX idx_place_records_region
ON place_records (region HASH);

CREATE INDEX idx_place_records_country_code
ON place_records (country_code HASH);

CREATE INDEX idx_place_records_category
ON place_records (category HASH);

CREATE INDEX idx_place_records_processing_status
ON place_records (processing_status HASH);

CREATE INDEX idx_place_records_processed_entity_id
ON place_records (processed_entity_id HASH);

When the query runs, the planner may choose the locality index:

				
					Index Scan using idx_place_records_locality on place_records
  Index Cond: locality = 'LargeCity'
  Storage Filter:
    processed_entity_id IS NULL
    processing_status = 'PENDING'
    category = 'Coffee Shop'
    region = 'ExampleState'
    country_code = 'US'
Sort Key: id
Sort Method: quicksort

That plan is not necessarily wrong.

What is a Storage Filter? In YugabyteDB, a Storage Filter means the remaining predicates are being pushed down to DocDB, the distributed storage layer. The index is used to find the first set of candidate rows, and then DocDB applies the additional filters before sending results back to the YSQL query layer. This is much better than filtering everything only in the SQL layer, but it can still be expensive if the chosen index produces a large number of candidate rows.

The planner chose the index it believed would eliminate the most rows up front. A specific locality is often more selective than a country, region, or broad category.

But there is a catch.

The “Small Town vs. Big City” Problem

The same query can perform very differently depending on the value passed into the predicate:

● For a small locality, the index scan might only read a few dozen rows. The remaining filters are cheap, and sorting a tiny result set is trivial.
● For a large locality, the same plan may need to scan thousands or hundreds of thousands of rows before applying the remaining filters.

That is the problem.

The index is selective for some values, but not selective enough for others.

Tuning reminder: A plan can be technically reasonable and still be too slow for specific values. Data distribution matters. A predicate like locality = 'SmallTown' and locality = 'LargeCity' may use the same index, but the amount of work behind that index scan can be very different.

Why the Sort Still Happens

The query includes:

				
					ORDER BY r.id
LIMIT 100

But the index on locality is a hash index:

				
					CREATE INDEX idx_place_records_locality
ON place_records (locality HASH);

A hash index helps find matching locality values, but it does not return rows ordered by id.

So YugabyteDB has to:

1. Find all rows matching the hashed locality value.
2. Apply the remaining filters.
3. Sort the surviving rows by id.
4. Return the first 100 rows.

For small result sets, this is fine.

For large result sets, this can become expensive.

First Improvement: Use a Composite Index

A better index should match more of the query predicates.

For example:

				
					CREATE INDEX idx_place_records_geo_category
ON place_records (
    (country_code, region, locality, category) HASH
);

This lets YugabyteDB find rows using the combined filter values instead of starting from only one column.

That is a big improvement because the index lookup becomes more targeted.

Instead of finding every row in a large locality and then filtering by country, region, and category, the database can directly locate rows matching the combined key.

Better Improvement: Make It Covering

The query only needs id, but the filters also reference:

				
					processing_status
processed_entity_id

If these columns are not available in the index, YugabyteDB may need to read from the base table to evaluate the rest of the query.

A covering index can avoid that extra table lookup:

				
					CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_place_records_geo_category_cover
ON place_records (
    (country_code, region, locality, category) HASH
)
INCLUDE (
    processing_status,
    processed_entity_id,
    id
);

With this shape, the plan can become an index-only scan:

				
					Index Only Scan using idx_place_records_geo_category_cover on place_records
  Index Cond:
    country_code = 'US'
    region = 'ExampleState'
    locality = 'LargeCity'
    category = 'Coffee Shop'
  Storage Filter:
    processed_entity_id IS NULL
    processing_status = 'PENDING'
Heap Fetches: 0

That is a major win in a distributed database.

Avoiding extra base-table reads can reduce network round trips, storage reads, and latency.

Why this helps: An index-only scan means YugabyteDB can answer the query from the index itself. When Heap Fetches: 0, the database did not need to jump back to the base table to fetch additional columns.

But There Is Still One Missing Piece

The covering index above helps a lot, but it still does not fully satisfy:

				
					ORDER BY id
LIMIT 100

Why?

Because id is in the INCLUDE list.

Included columns are stored as payload in the index. They are available to the query, but they are not part of the index ordering.

That means YugabyteDB may still need to scan all matching index rows, sort them by id, and then apply the LIMIT.

For a few hundred matching rows, that is probably fine.

For tens of thousands of matching rows, that can still be expensive.

Best Version: Hash the Filters, Sort by the `ORDER BY` Column, Include the Rest

To optimize both filtering and ordering, put the equality predicates in the hash key, put the ORDER BY column in the index key as an ordered column, and include the remaining filter columns.

				
					CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_place_records_geo_category_id_cover
ON place_records (
    (country_code, region, locality, category) HASH,
    id ASC
)
INCLUDE (
    processing_status,
    processed_entity_id
);

This index does three important things:

Index Part	Purpose
`(country_code, region, locality, category) HASH`	Targets the exact combination of equality predicates.
`id ASC`	Stores matching rows in the same order requested by `ORDER BY id`.
`INCLUDE (processing_status, processed_entity_id)`	Allows the remaining filters to be evaluated from the index without fetching the base table row.

This gives the database a much better path:

1. Find the exact geo/category group using the compound hash key.
2. Read rows already ordered by id.
3. Apply the remaining filters from the index.
4. Stop as soon as 100 matching rows are found.

That last part is important.

When the index order matches the ORDER BY, the database has a chance to stop early instead of scanning and sorting every matching row.

Related YugabyteDB Tip: This tuning pattern builds on the same index design principles covered in Designing the Best Index for a Query in YugabyteDB. The best index is usually not the one that optimizes only filtering, sorting, or covering behavior in isolation. It is the one that supports the full access pattern of the query.

Why Not Just Put Everything in the Hash Key?

You might be tempted to do this:

				
					CREATE INDEX idx_place_records_all_filters
ON place_records (
    (country_code, region, locality, category, processing_status, processed_entity_id, id) HASH
);

That is usually not what you want.

Hashing id destroys the useful ordering for:

				
					ORDER BY id
LIMIT 100

For this query pattern, id should be an ordered clustering column, not part of the hash key.

Use the hash key for equality-based lookup.

Use ordered columns for sorting and range access.

Use included columns for extra filter or projection columns that do not need to control ordering.

What About a Partial Index?

If this query always looks for unprocessed rows, a partial index may be even smaller and more efficient:

				
					CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_place_records_pending_geo_category_id
ON place_records (
    (country_code, region, locality, category) HASH,
    id ASC
)
WHERE processed_entity_id IS NULL
  AND processing_status = 'PENDING';

This can be a great option for queue-like workloads where the application repeatedly asks:

Give me the next 100 pending records for this location and category.

The tradeoff is that the index only helps queries matching that predicate.

Indexing tradeoff: A partial index can be smaller and faster for a specific workload, but it is less general. A full covering index can help more query variants, but it costs more storage and adds more write overhead.

Full Index Comparison

Index Design	What It Helps	What It Misses
Single-column index on `locality`	Good for selective localities.	Can scan too many rows for large localities.
Compound hash index on geo/category columns	Targets the equality predicates more directly.	May still need base table reads or sorting.
Covering index with `INCLUDE`	Can enable an index-only scan.	Included columns do not provide ordering.
Compound hash key + `id ASC` + included filters	Supports targeted lookup, index-only filtering, ordered results, and early exit with `LIMIT`.	Adds write overhead and should be validated against workload frequency.

Final Recommended Index

For this query pattern, the best general-purpose index is:

				
					CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_place_records_geo_category_id_cover
ON place_records (
    (country_code, region, locality, category) HASH,
    id ASC
)
INCLUDE (
    processing_status,
    processed_entity_id
);

If the query always targets pending/unprocessed records, consider the smaller partial-index version:

				
					CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_place_records_pending_geo_category_id
ON place_records (
    (country_code, region, locality, category) HASH,
    id ASC
)
WHERE processed_entity_id IS NULL
  AND processing_status = 'PENDING';

Final Takeaway

When tuning a YugabyteDB query with multiple equality filters, ORDER BY, and LIMIT, do not only ask which single column is most selective.

Ask whether the index supports the whole access pattern.

For this query, the winning index design is:

				
					Equality filters -> HASH key
ORDER BY column  -> ASC/DESC clustering key
Extra filters    -> INCLUDE or partial-index predicate

That combination can turn a scan/filter/sort plan into a targeted index-only scan that reads fewer rows, avoids extra table fetches, and can stop early once the LIMIT is satisfied.

Have Fun!

While in Hawaii for a business retreat, I had the opportunity to take a helicopter tour with a few team members and see some truly spectacular views of the islands, including Maui and Molokai.

We flew past incredible coastlines, lush valleys, and some of the largest waterfalls in Hawaii. I’ll admit, my wife had to convince me to go… but I’m so glad she did.

A huge thank you to YugabyteDB for sponsoring such an unforgettable experience, and to the incredible pilots at Maverick Helicopters for making the flight safe, smooth, and memorable.

And fittingly, given the name Maverick, the music from Top Gun played through our headsets as we took off from the airport. Pretty intense. 🌺🚁🎶