Simulating Row-Level TTL in YSQL

YugabyteDB supports row-based TTL in YCQL (Cassandra-compatible), but not (yet) in YSQL (Postgres-compatible).  This feature request is tracked in GitHub Issue: #3712.

If you’re writing SQL apps today in YSQL and want old rows to “expire,” you need a strategy that fits a distributed system where large DELETEs can be expensive.

Good news: there are several TTL-like patterns that work great in YSQL. Some hide expired rows; others purge them efficiently without hammering your YugabyteDB cluster.

TL;DR … what to use when
  1. Time-range partitioning + drop/detach → fastest & simplest at scale (ideal for time-series and logs).

  2. Hash-bucketed deletes with pg_cron → when you can’t partition (or for backfills), slice work with yb_hash_code(...).

  3. “Soft TTL” via RLS or a view → hide old rows instantly; purge later with #1 or #2.

  4. Batch deletes via ybctid → tiny, locality-friendly batches that play nicely with compaction.

Option 1: Time-Range Partitioning + Drop/Detach 🚀

Partition by time (daily/weekly/monthly). When data ages out, drop or detach the old partition… this is a near-instant metadata operation (no row-by-row deletes).

1A) DIY with native partitions + pg_cron (step-by-step)

(1) Create the partitioned table

Ensure the PRIMARY KEY includes the partition key (created_at)… required for partitioned PKs.

				
					CREATE TABLE IF NOT EXISTS events (
  id         BIGSERIAL,
  created_at timestamptz NOT NULL,
  payload    jsonb       NOT NULL,
  PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);
				
			

(2) Create a monthly partition (manually, example)

				
					CREATE TABLE IF NOT EXISTS events_2025_09 PARTITION OF events
  FOR VALUES FROM ('2025-09-01') TO ('2025-10-01');
				
			

Repeat as needed for each month you want to keep “online.”

3) Rotate out old data: archive (detach) or purge (drop)

				
					-- Archive (detach and move to archive schema)
ALTER TABLE events DETACH PARTITION events_2025_06;
ALTER TABLE events_2025_06 SET SCHEMA archive;

-- Purge (free disk immediately)
DROP TABLE IF EXISTS events_2025_06;

				
			

(4) Automate with pg_cron

With pg_cron, you can (a) pre-create next month’s partition and (b) detach/drop anything older than your retention.

Helper to pre-create a monthly partition when missing:

				
					CREATE OR REPLACE PROCEDURE ensure_events_month(p_month date)
LANGUAGE plpgsql AS $$
DECLARE
  v_from date := date_trunc('month', p_month)::date;
  v_to   date := (date_trunc('month', p_month) + interval '1 month')::date;
  v_part text := 'events_' || to_char(v_from, 'YYYY_MM');
  v_exists boolean;
BEGIN
  SELECT EXISTS (
    SELECT 1
    FROM pg_class c
    WHERE c.relname = v_part
  ) INTO v_exists;

  IF NOT v_exists THEN
    EXECUTE format(
      'CREATE TABLE %I PARTITION OF events FOR VALUES FROM (%L) TO (%L)',
      v_part, v_from::timestamptz, v_to::timestamptz
    );
  END IF;
END $$;
				
			

Rotation procedure (calls the helper, then detaches/drops old partitions):

				
					CREATE OR REPLACE PROCEDURE rotate_events(retention_days int DEFAULT 90)
LANGUAGE plpgsql AS $$
DECLARE
  next_month date := (date_trunc('month', now()) + interval '1 month')::date;
  cut_off    timestamptz := now() - make_interval(days => retention_days);
  part       record;
  from_ts    timestamptz;
BEGIN
  -- Ensure next month’s partition exists
  CALL ensure_events_month(next_month);

  -- Retire any partition fully older than the cutoff
  FOR part IN
    SELECT c.relname, pg_get_expr(pt.bound, pt.parent) AS bounds
      FROM pg_partitioned_table pt
      JOIN pg_class p ON p.oid = pt.partrelid
      JOIN pg_inherits i ON i.inhparent = p.oid
      JOIN pg_class c ON c.oid = i.inhrelid
     WHERE p.relname = 'events'
  LOOP
    -- Extract the FROM('yyyy-mm-01') bound and determine its month
    from_ts := (regexp_match(part.bounds, $$FROM \('([^']+)'\)$$))[1]::timestamptz;

    IF from_ts + interval '1 month' < cut_off THEN
      -- Choose one:
      EXECUTE format('ALTER TABLE events DETACH PARTITION %I', part.relname);  -- archive
      -- EXECUTE format('DROP TABLE IF EXISTS %I', part.relname);              -- purge
    END IF;
  END LOOP;
END $$;
				
			

Schedule it (e.g., monthly on the 1st at 00:05):

				
					SELECT cron.schedule(
  'events_rotate_monthly',
  '5 0 1 * *',
  $$CALL rotate_events(90);$$
);
				
			

(5) Inspect what partitions exist

				
					SELECT
  c.relname AS partition_name,
  pg_get_expr(pt.bound, pt.parent) AS bounds
FROM pg_partitioned_table pt
JOIN pg_class p ON p.oid = pt.partrelid
JOIN pg_inherits i ON i.inhparent = p.oid
JOIN pg_class c ON c.oid = i.inhrelid
WHERE p.relname = 'events'
ORDER BY c.relname;
				
			
1B) Or use pg_partman for hands-off management

pg_partman can pre-create partitions and enforce retention automatically.

				
					-- Enable the extension
CREATE EXTENSION IF NOT EXISTS pg_partman;

-- Parent table (same as above)
CREATE TABLE IF NOT EXISTS events (
  id         BIGSERIAL,
  created_at timestamptz NOT NULL,
  payload    jsonb       NOT NULL,
  PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);

-- Hand over to pg_partman (monthly partitions on created_at)
SELECT partman.create_parent(
  p_parent_table := 'public.events',
  p_control      := 'created_at',
  p_type         := 'native',
  p_interval     := 'monthly',
  p_premake      := 4,          -- pre-create 4 future partitions
  p_retention    := '3 months'  -- drop (or detach) older than 3 months
);

-- Run maintenance periodically to apply retention & premake
SELECT cron.schedule(
  'events_partman_maintenance',
  '0 * * * *',
  $$SELECT partman.run_maintenance();$$
);
-- If your deployment exposes run_maintenance() as a PROCEDURE instead of a FUNCTION:
-- $$CALL partman.run_maintenance();$$
				
			

Tip: Prefer DETACH for a warm “archive” tier and DROP when you truly want to reclaim space immediately.

Option 2: Scheduled Deletes with pg_cron + yb_hash_code() ⏱️

When partitioning isn’t feasible, delete in tiny batches and only within a single hash slice per job. This localizes work to a subset of tablets and keeps compaction smooth.

This approach generalizes the Quicker table row counts trick: bucket work by yb_hash_code() and walk small ranges.

We’ll use one generic function that:

  • ● Auto-detects the table’s HASH key columns from the PRIMARY KEY (e.g., PRIMARY KEY((c1,c2) HASH, c3) → hashes only c1,c2),

  • or lets you override the columns to hash (p_hash_cols),

  • deletes via ybctid batches (fast, storage-friendly), and

  • filters each run to a hash range (p_lo..p_hi).

				
					-- yb_ttl_delete_bucket: delete up to p_batch_limit expired rows
-- from a single yb_hash_code slice [p_lo, p_hi].
-- Uses only HASH-key columns by default; can be overridden with p_hash_cols.

CREATE OR REPLACE FUNCTION yb_ttl_delete_bucket(
  p_schema        text,           -- e.g. 'public'
  p_table         text,           -- e.g. 'events'
  p_ts_col        text,           -- TTL column, e.g. 'expires_at'
  p_retention     interval,       -- e.g. '0 days' if p_ts_col already encodes expiry
  p_lo            int,            -- inclusive hash lower bound (0..65535)
  p_hi            int,            -- inclusive hash upper bound
  p_batch_limit   int    DEFAULT 5000,
  p_hash_cols     text[] DEFAULT NULL  -- optional override: ARRAY['tenant_id','id']
) RETURNS integer
LANGUAGE plpgsql AS $$
DECLARE
  v_sql       text;
  v_deleted   int;
  v_hash_arg  text;   -- "col1,col2,...", quoted
  v_pkdef     text;   -- PRIMARY KEY definition
  v_csv       text;   -- HASH key CSV from PK def
BEGIN
  IF p_hash_cols IS NOT NULL AND array_length(p_hash_cols, 1) IS NOT NULL THEN
    SELECT string_agg(format('%I', col), ',')
      INTO v_hash_arg
    FROM unnest(p_hash_cols) AS col;

  ELSE
    SELECT pg_get_indexdef(i.indexrelid)
      INTO v_pkdef
    FROM pg_index i
    JOIN pg_class c ON c.oid = i.indrelid
    JOIN pg_namespace n ON n.oid = c.relnamespace
    WHERE n.nspname = p_schema
      AND c.relname  = p_table
      AND i.indisprimary;

    IF v_pkdef IS NULL THEN
      RAISE EXCEPTION 'Table %.% has no PRIMARY KEY', p_schema, p_table;
    END IF;

    -- Prefer explicit HASH key columns if present: PRIMARY KEY((... ) HASH, ...)
    SELECT (regexp_matches(v_pkdef, '\(\(([^)]*)\)\s*HASH', 'i'))[1]
      INTO v_csv;

    IF v_csv IS NOT NULL THEN
      SELECT string_agg(format('%I', btrim(s)), ',')
        INTO v_hash_arg
      FROM unnest(string_to_array(v_csv, ',')) AS s;
    ELSE
      -- Fallback: all PK cols in key order (ORDER BY is inside string_agg)
      SELECT string_agg(format('%I', a.attname), ',' ORDER BY k.ord)
        INTO v_hash_arg
      FROM pg_index i
      JOIN pg_class c ON c.oid = i.indrelid
      JOIN pg_namespace n ON n.oid = c.relnamespace
      JOIN unnest(i.indkey) WITH ORDINALITY k(attnum, ord) ON TRUE
      JOIN pg_attribute a ON a.attrelid = c.oid AND a.attnum = k.attnum
      WHERE n.nspname = p_schema
        AND c.relname  = p_table
        AND i.indisprimary;
    END IF;
  END IF;

  IF v_hash_arg IS NULL OR v_hash_arg = '' THEN
    RAISE EXCEPTION
      'yb_ttl_delete_bucket: no hash columns resolved for %.%. Provide p_hash_cols or define a PRIMARY KEY.',
      p_schema, p_table;
  END IF;

  v_sql := format($f$
    WITH doomed AS (
      SELECT ybctid
      FROM %1$I.%2$I
      WHERE %3$I < now() - $1
        AND yb_hash_code(%4$s) BETWEEN $2 AND $3
      ORDER BY ybctid
      LIMIT $4
    )
    DELETE FROM %1$I.%2$I e
    USING doomed d
    WHERE e.ybctid = d.ybctid
    RETURNING 1
  $f$, p_schema, p_table, p_ts_col, v_hash_arg);

  EXECUTE v_sql USING p_retention, p_lo, p_hi, p_batch_limit
    INTO v_deleted;

  RETURN COALESCE(v_deleted, 0);
END $$;
				
			

Examples:

Auto-use the HASH key from the PK

				
					-- If table has PRIMARY KEY((c1,c2) HASH, c3), this hashes only (c1,c2).
SELECT yb_ttl_delete_bucket('public','events','created_at',
                            '90 days', 0, 4095, 5000);
				
			

Override columns to hash (multi-tenant):

				
					SELECT yb_ttl_delete_bucket('sales','orders','created_at',
                            '120 days', 4096, 8191, 4000,
                            ARRAY['tenant_id','id']);
				
			
Schedule 16 tiny jobs with pg_cron

yb_hash_code() returns 0..65535. Sixteen buckets → width 4096 each.

				
					-- Cover all 16 buckets over ~4 minutes with staggered runs (tune as needed)

-- Buckets 0..3
SELECT cron.schedule('ttl_ev_b0', '* * * * *',
$$SELECT yb_ttl_delete_bucket('public','events','created_at','90 days',     0,  4095, 5000);$$);

SELECT cron.schedule('ttl_ev_b1', '*/2 * * * *',
$$SELECT yb_ttl_delete_bucket('public','events','created_at','90 days',  4096,  8191, 5000);$$);

SELECT cron.schedule('ttl_ev_b2', '*/3 * * * *',
$$SELECT yb_ttl_delete_bucket('public','events','created_at','90 days',  8192, 12287, 5000);$$);

SELECT cron.schedule('ttl_ev_b3', '*/4 * * * *',
$$SELECT yb_ttl_delete_bucket('public','events','created_at','90 days', 12288, 16383, 5000);$$);

-- Repeat similarly for buckets 4..15 (or emit via a small generator script)
				
			

Tuning tips

  • Keep each batch small (p_batch_limit 2k–10k).

  • Scale by more buckets and more frequent runs, not giant deletes.

  • Index the TTL column (created_at) if it’s selective on large tables.

Option 3: Row-Level Security “Soft TTL” 🔒

Sometimes you just need to hide expired rows instead of deleting them:

				
					ALTER TABLE events ENABLE ROW LEVEL SECURITY;

CREATE POLICY ttl_only_fresh ON events
USING (created_at >= now() - interval '90 days');
				
			

Or wrap a simple view around the table:

				
					CREATE VIEW current_events AS
SELECT *
FROM events
WHERE created_at >= now() - interval '90 days';
				
			

⚠️ Rows remain on disk until you clean them up later. Great for apps that only need to filter.

Option 4: Batch Deletes via ybctid 🎯

ybctid is Yugabyte’s internal row identifier… perfect for deleting rows in small, ordered batches.

				
					WITH doomed AS (
  SELECT ybctid
  FROM events
  WHERE created_at < now() - interval '90 days'
  ORDER BY ybctid
  LIMIT 5000
)
DELETE FROM events e
USING doomed d
WHERE e.ybctid = d.ybctid;
				
			

You can also combine this with the hash slice idea (Option 2) by adding a yb_hash_code(...) BETWEEN ... filter inside the doomed CTE.

Summary
  • Best performance: Time-range partitioning + scheduled drop/detach (pg_cron) or fully managed via pg_partman.

  • ● When partitioning isn’t possible (or for backfills): yb_hash_code-bucketed deletes with our yb_ttl_delete_bucket(...) function—tiny, frequent jobs.

  • ● Quick to adopt: RLS or views to hide old data immediately; purge later.

  • ● Low-level control: ybctid batches for smooth compaction and predictable IO.

⚔️✅ These patterns are battle-tested for distributed SQL. Until native YSQL TTL lands, they’ll keep your clusters lean, your queries fast, and your operations predictable.

Have Fun!

Chippy, our backyard chipmunk, is getting so comfortable that he now lounges on the back deck while my wife and I sit just a few feet away on the glider!