Simulating Row-Level TTL in YSQL

YugabyteDB supports row-based TTL in YCQL (Cassandra-compatible), but not (yet) in YSQL (Postgres-compatible). This feature request is tracked in GitHub Issue: #3712.

If you’re writing SQL apps today in YSQL and want old rows to “expire,” you need a strategy that fits a distributed system where large DELETEs can be expensive.

Good news: there are several TTL-like patterns that work great in YSQL. Some hide expired rows; others purge them efficiently without hammering your YugabyteDB cluster.

TL;DR … what to use when

Time-range partitioning + drop/detach → fastest & simplest at scale (ideal for time-series and logs).
Hash-bucketed deletes with pg_cron → when you can’t partition (or for backfills), slice work with yb_hash_code(...).
“Soft TTL” via RLS or a view → hide old rows instantly; purge later with #1 or #2.
Batch deletes via ybctid → tiny, locality-friendly batches that play nicely with compaction.

Option 1: Time-Range Partitioning + Drop/Detach 🚀

Partition by time (daily/weekly/monthly). When data ages out, drop or detach the old partition… this is a near-instant metadata operation (no row-by-row deletes).

1A) DIY with native partitions + `pg_cron` (step-by-step)

(1) Create the partitioned table

Ensure the PRIMARY KEY includes the partition key (created_at)… required for partitioned PKs.

				
					CREATE TABLE IF NOT EXISTS events (
  id         BIGSERIAL,
  created_at timestamptz NOT NULL,
  payload    jsonb       NOT NULL,
  PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);

(2) Create a monthly partition (manually, example)

				
					CREATE TABLE IF NOT EXISTS events_2025_09 PARTITION OF events
  FOR VALUES FROM ('2025-09-01') TO ('2025-10-01');

Repeat as needed for each month you want to keep “online.”

3) Rotate out old data: archive (detach) or purge (drop)

				
					-- Archive (detach and move to archive schema)
ALTER TABLE events DETACH PARTITION events_2025_06;
ALTER TABLE events_2025_06 SET SCHEMA archive;

-- Purge (free disk immediately)
DROP TABLE IF EXISTS events_2025_06;

(4) Automate with pg_cron

With pg_cron, you can (a) pre-create next month’s partition and (b) detach/drop anything older than your retention.

Helper to pre-create a monthly partition when missing:

				
					CREATE OR REPLACE PROCEDURE ensure_events_month(p_month date)
LANGUAGE plpgsql AS $$
DECLARE
  v_from date := date_trunc('month', p_month)::date;
  v_to   date := (date_trunc('month', p_month) + interval '1 month')::date;
  v_part text := 'events_' || to_char(v_from, 'YYYY_MM');
  v_exists boolean;
BEGIN
  SELECT EXISTS (
    SELECT 1
    FROM pg_class c
    WHERE c.relname = v_part
  ) INTO v_exists;

  IF NOT v_exists THEN
    EXECUTE format(
      'CREATE TABLE %I PARTITION OF events FOR VALUES FROM (%L) TO (%L)',
      v_part, v_from::timestamptz, v_to::timestamptz
    );
  END IF;
END $$;

Rotation procedure (calls the helper, then detaches/drops old partitions):

				
					CREATE OR REPLACE PROCEDURE rotate_events(retention_days int DEFAULT 90)
LANGUAGE plpgsql AS $$
DECLARE
  next_month date := (date_trunc('month', now()) + interval '1 month')::date;
  cut_off    timestamptz := now() - make_interval(days => retention_days);
  part       record;
  from_ts    timestamptz;
BEGIN
  -- Ensure next month’s partition exists
  CALL ensure_events_month(next_month);

  -- Retire any partition fully older than the cutoff
  FOR part IN
    SELECT c.relname, pg_get_expr(pt.bound, pt.parent) AS bounds
      FROM pg_partitioned_table pt
      JOIN pg_class p ON p.oid = pt.partrelid
      JOIN pg_inherits i ON i.inhparent = p.oid
      JOIN pg_class c ON c.oid = i.inhrelid
     WHERE p.relname = 'events'
  LOOP
    -- Extract the FROM('yyyy-mm-01') bound and determine its month
    from_ts := (regexp_match(part.bounds, $$FROM \('([^']+)'\)$$))[1]::timestamptz;

    IF from_ts + interval '1 month' < cut_off THEN
      -- Choose one:
      EXECUTE format('ALTER TABLE events DETACH PARTITION %I', part.relname);  -- archive
      -- EXECUTE format('DROP TABLE IF EXISTS %I', part.relname);              -- purge
    END IF;
  END LOOP;
END $$;

Schedule it (e.g., monthly on the 1st at 00:05):

				
					SELECT cron.schedule(
  'events_rotate_monthly',
  '5 0 1 * *',
  $$CALL rotate_events(90);$$
);

(5) Inspect what partitions exist

				
					SELECT
  c.relname AS partition_name,
  pg_get_expr(pt.bound, pt.parent) AS bounds
FROM pg_partitioned_table pt
JOIN pg_class p ON p.oid = pt.partrelid
JOIN pg_inherits i ON i.inhparent = p.oid
JOIN pg_class c ON c.oid = i.inhrelid
WHERE p.relname = 'events'
ORDER BY c.relname;

1B) Or use pg_partman for hands-off management

pg_partman can pre-create partitions and enforce retention automatically.

				
					-- Enable the extension
CREATE EXTENSION IF NOT EXISTS pg_partman;

-- Parent table (same as above)
CREATE TABLE IF NOT EXISTS events (
  id         BIGSERIAL,
  created_at timestamptz NOT NULL,
  payload    jsonb       NOT NULL,
  PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);

-- Hand over to pg_partman (monthly partitions on created_at)
SELECT partman.create_parent(
  p_parent_table := 'public.events',
  p_control      := 'created_at',
  p_type         := 'native',
  p_interval     := 'monthly',
  p_premake      := 4,          -- pre-create 4 future partitions
  p_retention    := '3 months'  -- drop (or detach) older than 3 months
);

-- Run maintenance periodically to apply retention & premake
SELECT cron.schedule(
  'events_partman_maintenance',
  '0 * * * *',
  $$SELECT partman.run_maintenance();$$
);
-- If your deployment exposes run_maintenance() as a PROCEDURE instead of a FUNCTION:
-- $$CALL partman.run_maintenance();$$

Tip: Prefer DETACH for a warm “archive” tier and DROP when you truly want to reclaim space immediately.

Option 2: Scheduled Deletes with `pg_cron` + `yb_hash_code()` ⏱️

When partitioning isn’t feasible, delete in tiny batches and only within a single hash slice per job. This localizes work to a subset of tablets and keeps compaction smooth.

This approach generalizes the “Quicker table row counts” trick: bucket work by yb_hash_code() and walk small ranges.

We’ll use one generic function that:

● Auto-detects the table’s HASH key columns from the PRIMARY KEY (e.g., PRIMARY KEY((c1,c2) HASH, c3) → hashes only c1,c2),
● or lets you override the columns to hash (p_hash_cols),
● deletes via ybctid batches (fast, storage-friendly), and
● filters each run to a hash range (p_lo..p_hi).

				
					-- yb_ttl_delete_bucket: delete up to p_batch_limit expired rows
-- from a single yb_hash_code slice [p_lo, p_hi].
-- Uses only HASH-key columns by default; can be overridden with p_hash_cols.

CREATE OR REPLACE FUNCTION yb_ttl_delete_bucket(
  p_schema        text,           -- e.g. 'public'
  p_table         text,           -- e.g. 'events'
  p_ts_col        text,           -- TTL column, e.g. 'expires_at'
  p_retention     interval,       -- e.g. '0 days' if p_ts_col already encodes expiry
  p_lo            int,            -- inclusive hash lower bound (0..65535)
  p_hi            int,            -- inclusive hash upper bound
  p_batch_limit   int    DEFAULT 5000,
  p_hash_cols     text[] DEFAULT NULL  -- optional override: ARRAY['tenant_id','id']
) RETURNS integer
LANGUAGE plpgsql AS $$
DECLARE
  v_sql       text;
  v_deleted   int;
  v_hash_arg  text;   -- "col1,col2,...", quoted
  v_pkdef     text;   -- PRIMARY KEY definition
  v_csv       text;   -- HASH key CSV from PK def
BEGIN
  IF p_hash_cols IS NOT NULL AND array_length(p_hash_cols, 1) IS NOT NULL THEN
    SELECT string_agg(format('%I', col), ',')
      INTO v_hash_arg
    FROM unnest(p_hash_cols) AS col;

  ELSE
    SELECT pg_get_indexdef(i.indexrelid)
      INTO v_pkdef
    FROM pg_index i
    JOIN pg_class c ON c.oid = i.indrelid
    JOIN pg_namespace n ON n.oid = c.relnamespace
    WHERE n.nspname = p_schema
      AND c.relname  = p_table
      AND i.indisprimary;

    IF v_pkdef IS NULL THEN
      RAISE EXCEPTION 'Table %.% has no PRIMARY KEY', p_schema, p_table;
    END IF;

    -- Prefer explicit HASH key columns if present: PRIMARY KEY((... ) HASH, ...)
    SELECT (regexp_matches(v_pkdef, '\(\(([^)]*)\)\s*HASH', 'i'))[1]
      INTO v_csv;

    IF v_csv IS NOT NULL THEN
      SELECT string_agg(format('%I', btrim(s)), ',')
        INTO v_hash_arg
      FROM unnest(string_to_array(v_csv, ',')) AS s;
    ELSE
      -- Fallback: all PK cols in key order (ORDER BY is inside string_agg)
      SELECT string_agg(format('%I', a.attname), ',' ORDER BY k.ord)
        INTO v_hash_arg
      FROM pg_index i
      JOIN pg_class c ON c.oid = i.indrelid
      JOIN pg_namespace n ON n.oid = c.relnamespace
      JOIN unnest(i.indkey) WITH ORDINALITY k(attnum, ord) ON TRUE
      JOIN pg_attribute a ON a.attrelid = c.oid AND a.attnum = k.attnum
      WHERE n.nspname = p_schema
        AND c.relname  = p_table
        AND i.indisprimary;
    END IF;
  END IF;

  IF v_hash_arg IS NULL OR v_hash_arg = '' THEN
    RAISE EXCEPTION
      'yb_ttl_delete_bucket: no hash columns resolved for %.%. Provide p_hash_cols or define a PRIMARY KEY.',
      p_schema, p_table;
  END IF;

  v_sql := format($f$
    WITH doomed AS (
      SELECT ybctid
      FROM %1$I.%2$I
      WHERE %3$I < now() - $1
        AND yb_hash_code(%4$s) BETWEEN $2 AND $3
      ORDER BY ybctid
      LIMIT $4
    )
    DELETE FROM %1$I.%2$I e
    USING doomed d
    WHERE e.ybctid = d.ybctid
    RETURNING 1
  $f$, p_schema, p_table, p_ts_col, v_hash_arg);

  EXECUTE v_sql USING p_retention, p_lo, p_hi, p_batch_limit
    INTO v_deleted;

  RETURN COALESCE(v_deleted, 0);
END $$;

Examples:

Auto-use the HASH key from the PK

				
					-- If table has PRIMARY KEY((c1,c2) HASH, c3), this hashes only (c1,c2).
SELECT yb_ttl_delete_bucket('public','events','created_at',
                            '90 days', 0, 4095, 5000);

Override columns to hash (multi-tenant):

				
					SELECT yb_ttl_delete_bucket('sales','orders','created_at',
                            '120 days', 4096, 8191, 4000,
                            ARRAY['tenant_id','id']);

Schedule 16 tiny jobs with `pg_cron`

yb_hash_code() returns 0..65535. Sixteen buckets → width 4096 each.

				
					-- Cover all 16 buckets over ~4 minutes with staggered runs (tune as needed)

-- Buckets 0..3
SELECT cron.schedule('ttl_ev_b0', '* * * * *',
$$SELECT yb_ttl_delete_bucket('public','events','created_at','90 days',     0,  4095, 5000);$$);

SELECT cron.schedule('ttl_ev_b1', '*/2 * * * *',
$$SELECT yb_ttl_delete_bucket('public','events','created_at','90 days',  4096,  8191, 5000);$$);

SELECT cron.schedule('ttl_ev_b2', '*/3 * * * *',
$$SELECT yb_ttl_delete_bucket('public','events','created_at','90 days',  8192, 12287, 5000);$$);

SELECT cron.schedule('ttl_ev_b3', '*/4 * * * *',
$$SELECT yb_ttl_delete_bucket('public','events','created_at','90 days', 12288, 16383, 5000);$$);

-- Repeat similarly for buckets 4..15 (or emit via a small generator script)

Tuning tips

● Keep each batch small (p_batch_limit 2k–10k).
● Scale by more buckets and more frequent runs, not giant deletes.
● Index the TTL column (created_at) if it’s selective on large tables.

Option 3: Row-Level Security “Soft TTL” 🔒

Sometimes you just need to hide expired rows instead of deleting them:

				
					ALTER TABLE events ENABLE ROW LEVEL SECURITY;

CREATE POLICY ttl_only_fresh ON events
USING (created_at >= now() - interval '90 days');

Or wrap a simple view around the table:

				
					CREATE VIEW current_events AS
SELECT *
FROM events
WHERE created_at >= now() - interval '90 days';

⚠️ Rows remain on disk until you clean them up later. Great for apps that only need to filter.

Option 4: Batch Deletes via `ybctid` 🎯

ybctid is Yugabyte’s internal row identifier… perfect for deleting rows in small, ordered batches.

				
					WITH doomed AS (
  SELECT ybctid
  FROM events
  WHERE created_at < now() - interval '90 days'
  ORDER BY ybctid
  LIMIT 5000
)
DELETE FROM events e
USING doomed d
WHERE e.ybctid = d.ybctid;

You can also combine this with the hash slice idea (Option 2) by adding a yb_hash_code(...) BETWEEN ... filter inside the doomed CTE.

Summary

● Best performance: Time-range partitioning + scheduled drop/detach (pg_cron) or fully managed via pg_partman.
● When partitioning isn’t possible (or for backfills): yb_hash_code-bucketed deletes with our yb_ttl_delete_bucket(...) function—tiny, frequent jobs.
● Quick to adopt: RLS or views to hide old data immediately; purge later.
● Low-level control: ybctid batches for smooth compaction and predictable IO.

⚔️✅ These patterns are battle-tested for distributed SQL. Until native YSQL TTL lands, they’ll keep your clusters lean, your queries fast, and your operations predictable.

Have Fun!