Instantly Estimate Row Counts (Smart, Fast, and Composite-Key Aware)

Getting a row count in YugabyteDB seems simple:

				
					SELECT COUNT(*) FROM transfers;

But in a distributed system, this is:

● A full-table scan across all tablets
● A cluster-wide operation
● Potentially slow and expensive for large tables

💡 Key Insight

In YugabyteDB, COUNT(*) is not just a local operation… it fans out across all nodes and tablets, making it significantly more expensive than in a single-node database.

⚡ The Classic Trick (Hash Sampling)

				
					SELECT 1000 * count(*) 
FROM transfers 
WHERE yb_hash_code(id) < 65536 / 1000;

This gives you a fast estimate by sampling a small portion of the data.

🧠 Why This Works

YugabyteDB distributes rows across a hash space (0 → 65535).

● Each row is assigned a hash bucket via yb_hash_code()
● Data is spread across tablets based on that hash
● Sampling a small portion of the hash space gives a representative subset

📘 What’s Really Happening

Instead of scanning every tablet, YugabyteDB only touches the tablets responsible for the selected hash range—dramatically reducing I/O and network cost.

📌 Hash sampling is designed for large tables. On very small tables, a sample may legitimately return zero rows, especially with a divisor like 1000 or 10000. In those cases, falling back to an exact COUNT(*) is often the best choice.

⚠️ The Gotcha (Composite Hash Keys)

Most table primary keys have only one column as the hash key:

				
					yb_hash_code(id)

But consider this table:

				
					CREATE TABLE test(
  c1 INT,
  c2 INT,
  c3 INT,
  PRIMARY KEY((c1, c2) HASH, c3 ASC)
);

👉 The hash distribution is based on:

				
					(c1, c2)

…not just c1.

⚠️ Important

For composite hash keys, you must use the full HASH portion of the primary key:


      yb_hash_code(c1, c2)

Using only one column can lead to misleading estimates.

🧬 The Smarter Approach

We can turn this into a reusable helper that:

● Uses pg_class.reltuples when available (fastest)
● Falls back to live hash sampling when needed
● Falls back to an exact row count when the sample is too small to trust
● Automatically detects composite hash keys
● Lets you tune sampling with a divisor

🧰 The Function (Auto-Detects Composite Hash Keys)

				
					CREATE OR REPLACE FUNCTION get_table_estimate(
    target_table text,
    sample_divisor integer DEFAULT 1000,
    prefer_reltuples boolean DEFAULT true
)
RETURNS bigint
LANGUAGE plpgsql
AS $$
DECLARE
    stats_estimate real;
    sample_count bigint;
    indexdef text;
    hash_cols_text text;
    hash_expr text;
    exact_count bigint;
BEGIN
    IF sample_divisor <= 0 THEN
        RAISE EXCEPTION 'sample_divisor must be > 0';
    END IF;

    -- Try reltuples first
    IF prefer_reltuples THEN
        SELECT c.reltuples
        INTO stats_estimate
        FROM pg_class c
        WHERE c.oid = target_table::regclass;

        IF stats_estimate IS NOT NULL AND stats_estimate >= 0 THEN
            RETURN stats_estimate::bigint;
        END IF;
    END IF;

    -- Get primary key definition
    SELECT pg_get_indexdef(i.indexrelid)
    INTO indexdef
    FROM pg_index i
    WHERE i.indrelid = target_table::regclass
      AND i.indisprimary;

    IF indexdef IS NULL THEN
        RAISE EXCEPTION 'Table % has no primary key', target_table;
    END IF;

    -- Extract HASH columns
    hash_cols_text := substring(indexdef FROM '\(\((.*?)\)\s+HASH');

    IF hash_cols_text IS NULL THEN
        RAISE EXCEPTION
          'Could not extract HASH columns from: %',
          indexdef;
    END IF;

    hash_expr := format('yb_hash_code(%s)', hash_cols_text);

    EXECUTE format(
        'SELECT count(*) FROM %s WHERE %s < 65536 / %s',
        target_table::regclass,
        hash_expr,
        sample_divisor
    )
    INTO sample_count;

    -- If the sample is too small, exact count is safer
    IF sample_count < 10 THEN
        EXECUTE format('SELECT count(*) FROM %s', target_table::regclass)
        INTO exact_count;

        RETURN exact_count;
    END IF;

    RETURN sample_count * sample_divisor;
END;
$$;

🧪 Examples

				
					-- Default (fast + smart)
SELECT get_table_estimate('transfers');

-- More accurate
SELECT get_table_estimate('transfers', 100);

-- Force real-time sampling (ignore reltuples)
SELECT get_table_estimate('transfers', 1000, false);

💡 Why This Works So Well

This function adapts to your environment:

● Uses metadata when stats are fresh (zero cost)
● Falls back to live sampling when needed
● Correctly handles composite hash keys automatically

Accuracy vs Speed

Method	Speed	Accuracy	Notes
`COUNT(*)`	Slow	100%	Full scan across the table
`reltuples`	Instant	Depends on ANALYZE	May be stale
`yb_hash_code()` sampling	Fast	~95–99%	Real-time estimate with low impact

When to Use What

Use Case	Best Option
Dashboard / monitoring	`reltuples`
Real-time estimate	`yb_hash_code()` sampling
Exact reporting	`COUNT(*)`

🔗 Related YugabyteDB Tips

Want to go deeper on row-count strategies?

📚 Big Picture

Row counting in YugabyteDB is a spectrum:

● Fastest: reltuples (metadata)
● Balanced: hash sampling (this tip)
● Exact + scalable: tablet-parallel techniques

Summary

🧾 Summary

● COUNT(*) is expensive in YugabyteDB because it scans all tablets across the cluster
● yb_hash_code() sampling provides a fast, low-impact way to estimate row counts in real time
● Composite hash keys must use the full HASH portion (for example, yb_hash_code(c1, c2))
● pg_class.reltuples is the fastest option when statistics are fresh
● The helper function combines both approaches for a smart, adaptive solution

Have Fun!