Instantly Estimate Row Counts (Smart, Fast, and Composite-Key Aware)

Getting a row count in YugabyteDB seems simple:

				
					SELECT COUNT(*) FROM transfers;
				
			

But in a distributed system, this is:

  • ● A full-table scan across all tablets
  • ● A cluster-wide operation
  • ● Potentially slow and expensive for large tables
πŸ’‘ Key Insight
In YugabyteDB, COUNT(*) is not just a local operation… it fans out across all nodes and tablets, making it significantly more expensive than in a single-node database.

⚑ The Classic Trick (Hash Sampling)

				
					SELECT 1000 * count(*) 
FROM transfers 
WHERE yb_hash_code(id) < 65536 / 1000;
				
			

This gives you a fast estimate by sampling a small portion of the data.

🧠 Why This Works

YugabyteDB distributes rows across a hash space (0 β†’ 65535).

  • ● Each row is assigned a hash bucket via yb_hash_code()
  • ● Data is spread across tablets based on that hash
  • ● Sampling a small portion of the hash space gives a representative subset
πŸ“˜ What’s Really Happening
Instead of scanning every tablet, YugabyteDB only touches the tablets responsible for the selected hash rangeβ€”dramatically reducing I/O and network cost.

πŸ“Œ Hash sampling is designed for large tables. On very small tables, a sample may legitimately return zero rows, especially with a divisor like 1000 or 10000. In those cases, falling back to an exact COUNT(*) is often the best choice.

⚠️ The Gotcha (Composite Hash Keys)

Most table primary keys have only one column as the hash key:

				
					yb_hash_code(id)
				
			

But consider this table:

				
					CREATE TABLE test(
  c1 INT,
  c2 INT,
  c3 INT,
  PRIMARY KEY((c1, c2) HASH, c3 ASC)
);
				
			

πŸ‘‰ The hash distribution is based on:

				
					(c1, c2)
				
			

…not just c1.

⚠️ Important
For composite hash keys, you must use the full HASH portion of the primary key:
yb_hash_code(c1, c2)
Using only one column can lead to misleading estimates.

🧬 The Smarter Approach

We can turn this into a reusable helper that:

  • ● Uses pg_class.reltuples when available (fastest)
  • ● Falls back to live hash sampling when needed
  • ● Falls back to an exact row count when the sample is too small to trust
  • ● Automatically detects composite hash keys
  • ● Lets you tune sampling with a divisor

🧰 The Function (Auto-Detects Composite Hash Keys)

				
					CREATE OR REPLACE FUNCTION get_table_estimate(
    target_table text,
    sample_divisor integer DEFAULT 1000,
    prefer_reltuples boolean DEFAULT true
)
RETURNS bigint
LANGUAGE plpgsql
AS $$
DECLARE
    stats_estimate real;
    sample_count bigint;
    indexdef text;
    hash_cols_text text;
    hash_expr text;
    exact_count bigint;
BEGIN
    IF sample_divisor <= 0 THEN
        RAISE EXCEPTION 'sample_divisor must be > 0';
    END IF;

    -- Try reltuples first
    IF prefer_reltuples THEN
        SELECT c.reltuples
        INTO stats_estimate
        FROM pg_class c
        WHERE c.oid = target_table::regclass;

        IF stats_estimate IS NOT NULL AND stats_estimate >= 0 THEN
            RETURN stats_estimate::bigint;
        END IF;
    END IF;

    -- Get primary key definition
    SELECT pg_get_indexdef(i.indexrelid)
    INTO indexdef
    FROM pg_index i
    WHERE i.indrelid = target_table::regclass
      AND i.indisprimary;

    IF indexdef IS NULL THEN
        RAISE EXCEPTION 'Table % has no primary key', target_table;
    END IF;

    -- Extract HASH columns
    hash_cols_text := substring(indexdef FROM '\(\((.*?)\)\s+HASH');

    IF hash_cols_text IS NULL THEN
        RAISE EXCEPTION
          'Could not extract HASH columns from: %',
          indexdef;
    END IF;

    hash_expr := format('yb_hash_code(%s)', hash_cols_text);

    EXECUTE format(
        'SELECT count(*) FROM %s WHERE %s < 65536 / %s',
        target_table::regclass,
        hash_expr,
        sample_divisor
    )
    INTO sample_count;

    -- If the sample is too small, exact count is safer
    IF sample_count < 10 THEN
        EXECUTE format('SELECT count(*) FROM %s', target_table::regclass)
        INTO exact_count;

        RETURN exact_count;
    END IF;

    RETURN sample_count * sample_divisor;
END;
$$;
				
			
πŸ§ͺ Examples
				
					-- Default (fast + smart)
SELECT get_table_estimate('transfers');

-- More accurate
SELECT get_table_estimate('transfers', 100);

-- Force real-time sampling (ignore reltuples)
SELECT get_table_estimate('transfers', 1000, false);
				
			
πŸ’‘ Why This Works So Well
This function adapts to your environment:
  • ● Uses metadata when stats are fresh (zero cost)
  • ● Falls back to live sampling when needed
  • ● Correctly handles composite hash keys automatically
Accuracy vs Speed
Method Speed Accuracy Notes
COUNT(*) Slow 100% Full scan across the table
reltuples Instant Depends on ANALYZE May be stale
yb_hash_code() sampling Fast ~95–99% Real-time estimate with low impact
When to Use What
Use Case Best Option
Dashboard / monitoring reltuples
Real-time estimate yb_hash_code() sampling
Exact reporting COUNT(*)

πŸ”— Related YugabyteDB Tips

Want to go deeper on row-count strategies?

πŸ“š Big Picture
Row counting in YugabyteDB is a spectrum:
  • ● Fastest: reltuples (metadata)
  • ● Balanced: hash sampling (this tip)
  • ● Exact + scalable: tablet-parallel techniques

Summary

🧾 Summary
  • ● COUNT(*) is expensive in YugabyteDB because it scans all tablets across the cluster
  • ● yb_hash_code() sampling provides a fast, low-impact way to estimate row counts in real time
  • ● Composite hash keys must use the full HASH portion (for example, yb_hash_code(c1, c2))
  • ● pg_class.reltuples is the fastest option when statistics are fresh
  • ● The helper function combines both approaches for a smart, adaptive solution

Have Fun!