Index JSONB Fields Without STORED Generated Column Write Amplification

When working with JSONB in YugabyteDB, it is common to promote frequently queried JSON fields into STORED generated columns.

That pattern works, especially when the extracted value is part of your logical table model.

But if the only reason you are creating the generated column is so that you can filter or join on a JSONB field, there may be a lighter option:

Use an expression index directly on the JSONB path.

And for latency-sensitive read patterns, especially in multi-region deployments, consider making that expression index covering for the exact query shape.

The goal is not to make indexing free.

Secondary indexes still have write cost.

The goal is to avoid adding extra STORED generated columns to the base table when the index itself can own the precomputed JSONB expression.

The Scenario

Suppose you have a high-ingest orders table that receives raw JSONB payloads.

Each JSON document contains a customer_id field, and you need to join or filter on that value.

				
					DROP TABLE IF EXISTS orders_stream;
DROP TABLE IF EXISTS customers;

CREATE TABLE customers (
    customer_id  TEXT PRIMARY KEY,
    tier         TEXT NOT NULL,
    discount_pct INT NOT NULL
);

CREATE TABLE orders_stream (
    order_id UUID PRIMARY KEY,
    payload  JSONB NOT NULL
);

INSERT INTO customers VALUES
    ('CUST-101', 'PLATINUM', 15),
    ('CUST-102', 'GOLD', 10),
    ('CUST-103', 'SILVER', 5);

INSERT INTO orders_stream VALUES
    ('11111111-1111-1111-1111-111111111111',
     '{"customer_id": "CUST-101", "total": 250.00, "items": 5}'::jsonb),
    ('22222222-2222-2222-2222-222222222222',
     '{"customer_id": "CUST-102", "total": 125.00, "items": 2}'::jsonb),
    ('33333333-3333-3333-3333-333333333333',
     '{"customer_id": "CUST-103", "total": 75.00, "items": 1}'::jsonb);

The Common Approach: STORED Generated Column

A common way to make customer_id easier to query is to promote it into a STORED generated column.

				
					DROP TABLE IF EXISTS orders_stream_generated;

CREATE TABLE orders_stream_generated (
    order_id UUID PRIMARY KEY,
    payload  JSONB NOT NULL,

    customer_id TEXT GENERATED ALWAYS AS
        (payload ->> 'customer_id') STORED
);

CREATE INDEX orders_stream_generated_customer_id_idx
ON orders_stream_generated (customer_id);

This works.

The database can now index customer_id like a normal column.

But there is a tradeoff.

The generated value is physically stored in the base table. If you also index it, the value is stored again in the secondary index.

That means every insert has to:

1. Store the JSONB document.
2. Evaluate the generated expression.
3. Store the generated column value in the base table.
4. Maintain the secondary index.

And every update to the JSONB column has to re-evaluate the generated expression because the generated column depends on payload.

If you have several generated columns derived from the same JSONB document, they can all be recalculated when payload changes.

That is the generated-column write amplification problem.

Important: STORED generated columns are not bad. They are useful when the extracted value belongs in your logical schema. But if the generated column only exists to support filtering or joins on a JSONB field, an expression index may be a cleaner option.

Option 1: Expression Index on the JSONB Path

Instead of adding customer_id as a stored column, you can index the JSONB expression directly.

				
					CREATE INDEX orders_stream_customer_id_expr_idx
ON orders_stream ((payload ->> 'customer_id'));

Now YugabyteDB has a secondary index whose key is the extracted value of:

				
					payload ->> 'customer_id'

That means this query can use the expression index when the optimizer decides it is cheaper than scanning the table:

				
					EXPLAIN (ANALYZE, DIST)
SELECT order_id
FROM orders_stream
WHERE payload ->> 'customer_id' = 'CUST-101';

But there is an important demo detail.

With only three rows in the table, YugabyteDB may still choose a sequential scan.

Example:

				
					Seq Scan on orders_stream
  Storage Filter: ((payload ->> 'customer_id'::text) = 'CUST-101'::text)
  Storage Table Rows Scanned: 3

That does not mean the expression index is broken.

It just means the table is tiny. Scanning three rows is cheaper than using an index.

Demo Note: On very small tables, YugabyteDB may choose a sequential scan even when an expression index exists. That is expected. Load a more realistic number of rows and run ANALYZE before deciding whether the index is useful.

Load a More Realistic Data Set

To make the demo more realistic, load 100,000 rows and gather statistics.

				
					INSERT INTO orders_stream
SELECT
    gen_random_uuid(),
    jsonb_build_object(
        'customer_id',
        CASE
            WHEN gs % 1000 = 0 THEN 'CUST-101'
            WHEN gs % 3 = 0 THEN 'CUST-102'
            ELSE 'CUST-103'
        END,
        'total', round((random() * 500)::numeric, 2),
        'items', (1 + floor(random() * 10))::int
    )
FROM generate_series(1, 100000) gs;

ANALYZE orders_stream;
ANALYZE customers;

Now the expression index becomes useful because CUST-101 is selective.

Example:

				
					yugabyte=# EXPLAIN (ANALYZE, DIST)
yugabyte-# SELECT order_id
yugabyte-# FROM orders_stream
yugabyte-# WHERE payload ->> 'customer_id' = 'CUST-101';
                                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------
---------------------------------------
 Index Scan using orders_stream_customer_id_expr_idx on orders_stream  (cost=88.18..114.68 rows=87 width=16) (actual time=1.435..1.504 rows=101 loops=1)
   Index Cond: ((payload ->> 'customer_id'::text) = 'CUST-101'::text)
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 0.713 ms
   Storage Table Read Ops: 1
   Storage Table Rows Scanned: 101
   Storage Index Read Requests: 1
   Storage Index Read Execution Time: 0.539 ms
   Storage Index Read Ops: 1
   Storage Index Rows Scanned: 101
 Planning Time: 7.632 ms
 Execution Time: 1.580 ms
 Storage Read Requests: 2
 Storage Read Execution Time: 1.252 ms
 Storage Read Ops: 2
 Storage Rows Scanned: 202
 Catalog Read Requests: 20
 Catalog Read Execution Time: 10.941 ms
 Catalog Read Ops: 20
 Catalog Write Requests: 0
 Storage Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 12.193 ms
 Peak Memory Usage: 64 kB
(24 rows)

Option 2: Covering Expression Index with `INCLUDE (order_id)`

If the query needs order_id, you can include it in the expression index.

				
					DROP INDEX IF EXISTS orders_stream_customer_id_expr_idx;

CREATE INDEX orders_stream_customer_id_expr_cover_idx
ON orders_stream ((payload ->> 'customer_id'))
INCLUDE (order_id);

Now run the join.

				
					EXPLAIN (ANALYZE, DIST)
SELECT
    o.order_id,
    c.tier,
    c.discount_pct
FROM orders_stream o
JOIN customers c
  ON (o.payload ->> 'customer_id') = c.customer_id
WHERE o.payload ->> 'customer_id' = 'CUST-101';

After loading 100,000 rows and running ANALYZE, the plan uses the JSONB expression index.

Example plan excerpt:

				
					Nested Loop
  ->  Index Scan using customers_pkey on customers c
        Index Cond: (customer_id = 'CUST-101'::text)

  ->  Index Scan using orders_stream_customer_id_expr_cover_idx on orders_stream o
        Index Cond: ((payload ->> 'customer_id'::text) = 'CUST-101'::text)
        Storage Table Rows Scanned: 101
        Storage Index Rows Scanned: 101

This is the important part:

				
					Index Scan using orders_stream_customer_id_expr_cover_idx

YugabyteDB is using the expression index on the JSONB path.

Instead of scanning all 100,000+ JSONB documents and evaluating payload ->> 'customer_id' for every row, it uses the precomputed expression value in the secondary index.

That is a big improvement.

But notice something important.

This is an Index Scan, not an Index Only Scan.

That means YugabyteDB used the index to find matching rows, but it still went back to the base table.

In the plan, you can see both index and table reads:

				
					Storage Table Read Requests: 1
Storage Index Read Requests: 1

So this index avoids a full table scan, but it does not fully avoid the base-table lookup.

Option 3: Fully Cover the Lookup with `INCLUDE (order_id, payload)`

To get an Index Only Scan for this lookup query, the index needs to contain everything the query needs.

In this case, that includes:

1. The indexed expression: payload ->> 'customer_id'
2. The projected column: order_id
3. The source JSONB column: payload

Create the index like this:

				
					DROP INDEX IF EXISTS orders_stream_customer_id_expr_cover_idx;

CREATE INDEX orders_stream_customer_id_expr_cover_idx
ON orders_stream ((payload ->> 'customer_id'))
INCLUDE (order_id, payload);

Now run the lookup query again:

				
					EXPLAIN (ANALYZE, DIST)
SELECT order_id
FROM orders_stream
WHERE payload ->> 'customer_id' = 'CUST-101';

Example plan:

				
					Index Only Scan using orders_stream_customer_id_expr_cover_idx on orders_stream
  Index Cond: (((payload ->> 'customer_id'::text)) = 'CUST-101'::text)
  Heap Fetches: 0
  Storage Index Read Requests: 1
  Storage Index Rows Scanned: 101

This is the cleanest read path in the demo. The query is satisfied from the index only. There are no base-table reads.

The key lines are:

				
					Index Only Scan using orders_stream_customer_id_expr_cover_idx
Heap Fetches: 0
Storage Index Read Requests: 1

Why Including `payload` Helped

At first, it may seem strange to include the full payload column when the query only returns order_id.

The query is:

				
					SELECT order_id
FROM orders_stream
WHERE payload ->> 'customer_id' = 'CUST-101';

But the predicate still references payload:

				
					payload ->> 'customer_id'

The index key stores the result of the expression, but YugabyteDB may still need the source column available to fully satisfy or recheck the expression without going back to the base table.

By including payload, the index has everything needed for this lookup query.

That is why the plan can become an Index Only Scan.

Multi-Region Angle

This matters even more in a multi-region deployment.

With the smaller covering index:

				
					DROP INDEX orders_stream_customer_id_expr_cover_idx;

CREATE INDEX orders_stream_customer_id_expr_cover_idx
ON orders_stream ((payload ->> 'customer_id'))
INCLUDE (order_id);

YugabyteDB can use the index, but the plan may still need to fetch the matching rows from the base table.

In a multi-region cluster, that base-table fetch may require an additional remote read if the base-table tablet leader is in another region.

With the larger covering index:

				
					CREATE INDEX orders_stream_customer_id_expr_cover_idx
ON orders_stream ((payload ->> 'customer_id'))
INCLUDE (order_id, payload);

the query can be answered from the index alone.

That can avoid an extra base-table lookup and reduce cross-region read latency, assuming the index access path is local or otherwise closer to the query execution region.

This can be a good pattern for read-heavy, latency-sensitive lookups.

Multi-region tip: In a multi-region deployment, avoiding the base-table lookup can be more valuable than it looks in a single-region demo. An Index Scan may still need to fetch rows from the main table, while an Index Only Scan can answer the query directly from the index. That can reduce cross-region reads for latency-sensitive queries.

But the Larger Covering Index Has a Cost

Including payload in the index is not free.

It means the JSONB document is stored twice:

1. Once in the base table.
2. Again in the secondary index.

That increases:

● Index size
● Write amplification
● Storage usage
● Compaction work
● Raft replication work
● Backfill time when the index is created
● Cost of updates when payload changes

So this is not automatically the best index for every high-ingest workload.

It is a read-optimized index.

Use it when the read benefit is worth the write and storage cost.

Important: Including the full payload column can turn the expression index into a true covering index for JSONB lookup queries, but it also copies the JSONB document into the index. This can be a good tradeoff for read-heavy, multi-region lookups with small payloads, but it may be too expensive for high-ingest workloads with large or frequently changing JSONB documents.

What the Demo Shows

Index Definition	Observed Plan	What It Means
`ON orders_stream ((payload ->> 'customer_id'))`	Tiny table still used `Seq Scan`	Expected for very small tables. The optimizer may decide a scan is cheaper.
`INCLUDE (order_id)`	`Index Scan`	Avoids the full table scan, but may still fetch rows from the base table.
`INCLUDE (order_id, payload)`	`Index Only Scan`	Can satisfy the lookup from the index only, with `Heap Fetches: 0`.

Write Path Reality Check

An expression index still has a write cost.

It is a secondary index, so YugabyteDB must maintain it on INSERT, UPDATE, and DELETE.

If the indexed JSONB expression changes, the index entry must change too.

For example:

				
					UPDATE orders_stream
SET payload = jsonb_set(payload, '{customer_id}', '"CUST-999"', true)
WHERE order_id = '11111111-1111-1111-1111-111111111111';

This changes the indexed expression:

				
					payload ->> 'customer_id'

That means YugabyteDB has to remove the old index entry and add a new one.

So the expression index does not eliminate secondary-index write cost.

It avoids the need to add an extra STORED generated column to the base table when the index itself can store the expression value.

Practical Rule of Thumb

Use a STORED generated column when:

● The extracted value is part of your logical schema.
● Applications need to query it as a normal column.
● The value is reused in many different ways.
● You are comfortable storing it in the base table.
● The JSONB document is not updated frequently, or the extra write cost is acceptable.

Use an expression index when:

● The extracted value mainly exists for filtering or joins.
● You do not need the value exposed as a regular column.
● You want to avoid widening the base table.
● You want the index to own the precomputed JSONB expression.

Use INCLUDE (order_id) when:

● You want to avoid scanning the full table.
● You need order_id available from the index.
● A base-table fetch is acceptable.
● The workload is still write-sensitive.

Use INCLUDE (order_id, payload) when:

● You want an Index Only Scan.
● The query is read-heavy and latency-sensitive.
● The JSONB payload is reasonably small.
● Avoiding the base-table lookup is worth the extra storage and write cost.
● The deployment is multi-region and remote base-table lookups are expensive.

Final Takeaway

When a JSONB field is only being promoted so it can be indexed, a STORED generated column may be unnecessary.

YugabyteDB can index the JSONB path directly:

				
					CREATE INDEX orders_stream_customer_id_expr_idx
ON orders_stream ((payload ->> 'customer_id'));

If you need to avoid scanning the base table, add the projected columns:

				
					CREATE INDEX orders_stream_customer_id_expr_cover_idx
ON orders_stream ((payload ->> 'customer_id'))
INCLUDE (order_id);

And if you need the lookup to be truly index-only, you may need to include the source JSONB column too:

				
					CREATE INDEX orders_stream_customer_id_expr_cover_idx
ON orders_stream ((payload ->> 'customer_id'))
INCLUDE (order_id, payload);

That last version can produce:

				
					Index Only Scan
Heap Fetches: 0
Storage Index Read Requests: 1

But it also duplicates the JSONB payload in the index.

So the best index depends on the workload.

For write-heavy ingest, keep the index as lean as possible.

For read-heavy, latency-sensitive, multi-region lookups, a larger covering expression index may be worth the extra write and storage cost.

Have Fun!

The YugabyteDB HQ has moved from Sunnyvale to… Sunnyvale. That’s right, same city, just a great new location:

100 Mathilda Place, Suite 250 Sunnyvale, CA 94086

I’m at the new office today, and I’m happy to report that the move included one very important piece of YugabyteDB history: the world map that adorned the old office for years.