Why Your Index Scan Still Reads Thousands of Rows

Common questions engineers search for:
  • ● Why is my indexed query still scanning thousands of rows?
  • ● Why does PostgreSQL / YugabyteDB scan 200k rows even with an index?
  • ● Why doesn’t LIMIT stop an index scan?
This tip explains exactly why that happens and how to design indexes that avoid it.

One of the most confusing moments in query tuning is when you see something like this in an execution plan:

				
					Index Scan
Rows scanned: 200,000
Rows returned: 100
				
			

Wait… what? You created the index specifically to make the query faster, yet the database still scans hundreds of thousands of rows.

This is one of the most common indexing misunderstandings.

An index does not guarantee that only a few rows are scanned.

It only determines where the database begins reading.

This difference is critical when designing effective indexes.

To see why, let’s reproduce the scenario with a simple dataset. We’ll create a table with one million rows, add a basic index on tenant_id, and run a query that retrieves recent activity for a tenant.

Even though the query uses the index, we’ll see that YugabyteDB still scans hundreds of thousands of rows before returning the first 100.  Once we understand why this happens, we’ll design a better index in the next tip that allows the database to stop scanning almost immediately.

🔎 Key Insight Indexes only control where scanning starts. If the index does not include enough filtering columns, the database may still need to scan very large ranges.
🧭 The Index Tuning Journey

This series shows how index tuning usually happens in stages. Each tip removes one layer of wasted work so YugabyteDB can answer the query more efficiently.

Tip Focus What Improves
1 Choose the right column order in an index The index matches the query shape better
2 Why your index scan still reads thousands of rows Shows why index usage alone is not enough
3 How to eliminate “Rows Removed by Filter” More filtering happens in the index before table reads
4 How IN() predicates create multiple index ranges Multiple predicate combinations create distinct index ranges
5 How ORDER BY LIMIT Can Change the Best Index in YugabyteDB Sort work can be reduced or eliminated
6 How covering indexes eliminate table reads The query may avoid table access entirely
7 Designing the best index for a query All tuning ideas come together into one practical design

Step 1: Create a Realistic Table

For this demo we’ll simulate a multi-tenant activity logging system.

This pattern appears in many applications:

  • ● SaaS platforms
  • ● observability pipelines
  • ● audit logging systems
  • ● analytics ingestion
				
					DROP SCHEMA IF EXISTS app CASCADE;

CREATE SCHEMA app;

CREATE TABLE app.activity_log (
    id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL,
    category TEXT NOT NULL,
    status TEXT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL,
    updated_at TIMESTAMPTZ,
    user_id UUID,
    source_id UUID
);
				
			

Step 2: Create the “Obvious” Index

Most engineers begin with a simple index on the main filtering column.

				
					CREATE INDEX activity_tenant_idx
ON app.activity_log(tenant_id);
				
			

This makes sense because most queries filter by tenant.

Step 3: Generate a Large Dataset

We will insert 1 million rows.

One tenant will represent a very active account, generating a large portion of the data.

This distribution helps demonstrate the scanning problem clearly.

				
					CREATE EXTENSION IF NOT EXISTS pgcrypto;

INSERT INTO app.activity_log
SELECT
  gen_random_uuid(),

  CASE
    WHEN random() < 0.20
    THEN '11111111-1111-1111-1111-111111111111'
    ELSE gen_random_uuid()
  END,

  (ARRAY[
   'LOGIN',
   'EXPORT',
   'IMPORT',
   'SYNC',
   'ALERT'
  ])[floor(random()*5+1)],

  (ARRAY[
   'PENDING',
   'RUNNING',
   'FAILED',
   'COMPLETE'
  ])[floor(random()*4+1)],

  NOW() - interval '30 days' * random(),

  NOW(),

  gen_random_uuid(),
  gen_random_uuid()
FROM generate_series(1,1000000);

ANALYZE app.activity_log;
				
			

After loading the data:

  • ~200k rows belong to one tenant
    ~800k rows belong to others

Step 4: Run a Typical Query

Imagine an application retrieving recent activity for a tenant.

				
					SELECT *
FROM app.activity_log
WHERE tenant_id =
'11111111-1111-1111-1111-111111111111'

AND category IN (
'LOGIN',
'EXPORT',
'SYNC'
)

AND status IN (
'PENDING',
'RUNNING'
)

ORDER BY created_at
LIMIT 100;
				
			

This pattern is extremely common:

  • tenant filter
    • + status filters
    • + ordered pagination

Step 5: Examine the Execution Plan

With the table statistics are up to date, we can look at the real execution plan.

				
					EXPLAIN (ANALYZE, DIST)
SELECT *
FROM app.activity_log
WHERE tenant_id =
'11111111-1111-1111-1111-111111111111'
AND category IN (
'LOGIN',
'EXPORT',
'SYNC'
)
AND status IN (
'PENDING',
'RUNNING'
)
ORDER BY created_at
LIMIT 100;
				
			

Typical output will look similar to this:

				
					 Limit  (cost=62590.81..62591.06 rows=100 width=93) (actual time=1365.248..1365.288 rows=100 loops=1)
   ->  Sort  (cost=62590.81..62741.59 rows=60313 width=93) (actual time=1365.246..1365.266 rows=100 loops=1)
         Sort Key: created_at
         Sort Method: top-N heapsort  Memory: 48kB
         ->  Index Scan using activity_tenant_idx on activity_log  (cost=86.88..60285.69 rows=60313 width=93) (actual time=11.728..1349.666 rows=60179 loops=1)
               Index Cond: (tenant_id = '11111111-1111-1111-1111-111111111111'::uuid)
               Storage Filter: ((status = ANY ('{PENDING,RUNNING}'::text[])) AND (category = ANY ('{LOGIN,EXPORT,SYNC}'::text[])))
               Storage Table Read Requests: 196
               Storage Table Read Execution Time: 1208.558 ms
               Storage Table Read Ops: 196
               Storage Table Rows Scanned: 200423
               Storage Index Read Requests: 196
               Storage Index Read Execution Time: 19.530 ms
               Storage Index Read Ops: 196
               Storage Index Rows Scanned: 200423
 Planning Time: 23.677 ms
 Execution Time: 1365.402 ms
 Storage Read Requests: 392
 Storage Read Execution Time: 1228.088 ms
 Storage Read Ops: 392
 Storage Rows Scanned: 400846
 Catalog Read Requests: 105
 Catalog Read Execution Time: 37.418 ms
 Catalog Read Ops: 105
 Catalog Write Requests: 0
 Storage Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 1265.506 ms
 Peak Memory Usage: 725 kB
				
			

At first glance, this may look fine because YugabyteDB is using the index:

				
					Index Scan using activity_tenant_idx
				
			

But the important lines are these:

				
					Storage Index Rows Scanned: 200423
Storage Table Rows Scanned: 200423
				
			

To return just 100 rows, YugabyteDB scanned more than 200,000 index entries and 200,000 table rows.

That is the core problem.

The index we created is only on:

				
					(tenant_id)
				
			

So YugabyteDB can efficiently locate the range for one tenant, but that tenant still owns about 200,000 rows in this dataset. The index narrows the search from the full table down to the tenant’s slice, but it cannot narrow it any further.

The additional predicates:

				
					category IN ('LOGIN','EXPORT','SYNC')
status   IN ('PENDING','RUNNING')
				
			

are not part of the index key, so they appear here as a storage filter:

				
					Storage Filter: ((status = ANY (...)) AND (category = ANY (...)))
				
			

And that leads to the second problem:

				
					-> Sort
   Sort Key: created_at
				
			

Because the rows are not already coming back in the right order, YugabyteDB has to sort those 60,179 matching rows before it can finally apply:

				
					LIMIT 100
				
			

So the actual workflow is:

  • 1. use the index to find the tenant’s range

  • 2. scan about 200k rows

  • 3. filter that down to about 60k rows

  • 4. sort those rows by created_at

  • 5. return the first 100

That is why the query still takes about 1.1 seconds, even though it is using an index.

🔎 Key Insight

The index helped YugabyteDB find the correct tenant. But it did not help it find the first 100 rows in created_at order.

Because of that, the database still performs a large scan, a large filter, and a sort before applying LIMIT.

Why This Query Is Still Slow

Even though the query uses an index, it still performs three expensive operations:

  • ● scanning a large range of rows for the tenant

  • ● filtering thousands of rows that do not match the query

  • ● sorting the remaining rows before applying LIMIT

In this example, YugabyteDB scanned more than 200,000 rows just to return 100 rows.

The index helped narrow the search to the correct tenant, but it did not help the database find the rows in the correct order.

As a result, the database still had to read a large range, apply filters, and perform a sort.

This is a very common indexing mistake in SQL databases. An index that supports filtering does not necessarily support efficient pagination queries.

Coming Next

In the next YugabyteDB Tip, we will redesign the index so that the database can:

  • ● return rows already ordered by created_at

  • ● stop scanning once it finds 100 rows

  • ● avoid the large scan and the sort

That change reduces the work from:

				
					200,000 rows scanned
				
			

to roughly:

				
					100 rows scanned
				
			

and drops query time from over a second to just a few milliseconds.

Have Fun!

📍 Day 1 of Sales Kick Off at the Oceano Hotel & Spa, Half Moon Bay Harbor.