You can still emulate row-level TTL in YSQL by storing each row’s expiry somewhere and running small, hash-bucketed delete jobs.
Three practical patterns:
- Add an
expires_atcolumn (recommended) - Derive
expires_atfrom minimal metadata (e.g.,created_at+ per-row TTL) - Use a sidecar TTL registry table (when you can’t change the base schema)
All three plug into the same cleanup engine: tiny, frequent deletes filtered by yb_hash_code(...) buckets.
Related reading: YCQL’s native row-TTL is described here.
Companion tip: Simulating Row-Level TTL in YSQL … discusses broader TTL strategies in YSQL including partitions + hash-bucket deletes + ybctid.
Why do this?
● Distributed deletes are expensive; big “sweep” deletes create tombstones and compaction spikes.
● The trick is to delete in tiny batches, one hash slice at a time, so work is localized and smooth.
● If you can partition by time and drop/detach partitions, that’s still king. But when you want row-level TTL (per-row expiry) and your table has no timestamp, use one of the patterns below.
Option A (recommended): Add an expires_at column
Adds the most flexibility and mirrors YCQL’s per-row TTL semantics.
-- 1) Add expiry; keep it NULL-safe so you can roll out gradually
ALTER TABLE your_table
ADD COLUMN IF NOT EXISTS expires_at timestamptz;
-- 2) Default behavior: 90 days from INSERT (choose your policy)
CREATE OR REPLACE FUNCTION set_expires_on_insert()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
IF NEW.expires_at IS NULL THEN
NEW.expires_at := now() + interval '90 days';
END IF;
RETURN NEW;
END $$;
DROP TRIGGER IF EXISTS trg_set_expires_ins ON your_table;
CREATE TRIGGER trg_set_expires_ins
BEFORE INSERT ON your_table
FOR EACH ROW EXECUTE FUNCTION set_expires_on_insert();
-- (Optional) If you want updates to refresh TTL (YCQL-like):
CREATE OR REPLACE FUNCTION refresh_expires_on_update()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
-- Only refresh if row TTL is policy-driven, not explicitly set
IF NEW.expires_at IS NULL OR NEW.expires_at = OLD.expires_at THEN
NEW.expires_at := now() + interval '90 days';
END IF;
RETURN NEW;
END $$;
DROP TRIGGER IF EXISTS trg_refresh_expires_upd ON your_table;
-- Enable this only if you want "update resets TTL"
-- CREATE TRIGGER trg_refresh_expires_upd
-- BEFORE UPDATE ON your_table
-- FOR EACH ROW EXECUTE FUNCTION refresh_expires_on_update();
-- 3) (Optional) Index to accelerate filtering (useful on very large tables)
CREATE INDEX IF NOT EXISTS your_table_expires_at_idx ON your_table (expires_at);
Cleanup: Use the generic hash-bucket delete function below and pass 'expires_at' as the TTL column (retention '0 days', since the column already encodes the expiry).
Option B: Derive expires_at from minimal metadata
Keep writes minimal, but still get a true per-row expiry.
-- Add a creation timestamp, defaulted at insert
ALTER TABLE your_table
ADD COLUMN IF NOT EXISTS created_at timestamptz DEFAULT now();
-- Optional: a per-row integer TTL (seconds); adapt if you prefer days
ALTER TABLE your_table
ADD COLUMN IF NOT EXISTS ttl_seconds integer DEFAULT 90*24*60*60;
-- Maintain a stored expiry value via a trigger (portable)
ALTER TABLE your_table
ADD COLUMN IF NOT EXISTS expires_at timestamptz;
CREATE OR REPLACE FUNCTION compute_expires()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
NEW.expires_at := NEW.created_at + make_interval(secs => COALESCE(NEW.ttl_seconds, 0));
RETURN NEW;
END $$;
DROP TRIGGER IF EXISTS trg_compute_expires_ins ON your_table;
CREATE TRIGGER trg_compute_expires_ins
BEFORE INSERT ON your_table
FOR EACH ROW EXECUTE FUNCTION compute_expires();
DROP TRIGGER IF EXISTS trg_compute_expires_upd ON your_table;
CREATE TRIGGER trg_compute_expires_upd
BEFORE UPDATE OF created_at, ttl_seconds ON your_table
FOR EACH ROW EXECUTE FUNCTION compute_expires();
From here, cleanup is identical to Option A.
Option C: Sidecar TTL registry (when you can’t change the base schema)
If you must not alter the original table, track expiries in a companion table that shares the base table’s PRIMARY KEY.
-- Base table stays unchanged:
-- CREATE TABLE base_table (..., PRIMARY KEY (...));
CREATE TABLE base_table_ttl (
-- Mirror the base table's PK columns & types:
pk1 ... NOT NULL,
pk2 ... NOT NULL,
expires_at timestamptz NOT NULL,
PRIMARY KEY (pk1, pk2),
FOREIGN KEY (pk1, pk2) REFERENCES base_table(pk1, pk2) ON DELETE CASCADE
);
-- Keep registry updated at INSERT time (extend if you want updates to refresh TTL):
CREATE OR REPLACE FUNCTION ttl_sidecar_on_insert()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
INSERT INTO base_table_ttl(pk1, pk2, expires_at)
VALUES (NEW.pk1, NEW.pk2, now() + interval '90 days')
ON CONFLICT (pk1, pk2) DO UPDATE
SET expires_at = EXCLUDED.expires_at;
RETURN NEW;
END $$;
DROP TRIGGER IF EXISTS trg_sidecar_ins ON base_table;
CREATE TRIGGER trg_sidecar_ins
AFTER INSERT ON base_table
FOR EACH ROW EXECUTE FUNCTION ttl_sidecar_on_insert();
There’s a sidecar-aware cleanup function below as well.
Cleanup engine: tiny, hash-bucketed deletes (generic)
YugabyteDB’s yb_hash_code() returns a 16-bit range 0..65535. We slice that into buckets (e.g., 16 buckets of width 4096) and delete only one slice per job so work stays localized. We also delete via ybctid in small batches for storage-friendly scans.
Cleaup Function (auto-detects HASH key columns)
● Uses your explicit
p_hash_colsif provided.● Otherwise reads the PRIMARY KEY definition and extracts only the HASH-partitioned columns (e.g.,
PRIMARY KEY((c1,c2) HASH, c3)→ hashesc1,c2).● Falls back to all PK columns if the PK has no explicit HASH segment.
● Deletes up to
p_batch_limitrows per call from one hash slice.
CREATE OR REPLACE FUNCTION yb_ttl_delete_bucket(
p_schema text, -- e.g. 'public'
p_table text, -- e.g. 'your_table'
p_ts_col text, -- 'expires_at' for Options A/B; for sidecar use the helper below
p_retention interval, -- '0 days' when p_ts_col already encodes expiry
p_lo int, -- inclusive hash lower bound (0..65535)
p_hi int, -- inclusive hash upper bound
p_batch_limit int DEFAULT 5000,
p_hash_cols text[] DEFAULT NULL -- optional override: ARRAY['tenant_id','id']
) RETURNS integer
LANGUAGE plpgsql AS $$
DECLARE
v_sql text;
v_deleted int;
v_hash_arg text; -- "col1,col2,..."
v_pkdef text;
v_csv text;
BEGIN
IF p_hash_cols IS NOT NULL AND array_length(p_hash_cols, 1) IS NOT NULL THEN
SELECT string_agg(format('%I', col), ',') INTO v_hash_arg
FROM unnest(p_hash_cols) AS col;
ELSE
SELECT pg_get_indexdef(i.indexrelid) INTO v_pkdef
FROM pg_index i
JOIN pg_class c ON c.oid = i.indrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname = p_schema
AND c.relname = p_table
AND i.indisprimary;
IF v_pkdef IS NULL THEN
RAISE EXCEPTION 'Table %.% has no PRIMARY KEY', p_schema, p_table;
END IF;
-- Extract columns inside the "((... ) HASH" section if present
SELECT (regexp_matches(v_pkdef, '\(\(([^)]*)\)\s*HASH', 'i'))[1]
INTO v_csv;
IF v_csv IS NOT NULL THEN
SELECT string_agg(format('%I', btrim(s)), ',')
INTO v_hash_arg
FROM unnest(string_to_array(v_csv, ',')) AS s;
ELSE
-- Fallback: all PK columns (in key order)
SELECT string_agg(format('%I', a.attname), ',')
INTO v_hash_arg
FROM pg_index i
JOIN pg_class c ON c.oid = i.indrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
JOIN unnest(i.indkey) WITH ORDINALITY k(attnum, ord) ON TRUE
JOIN pg_attribute a ON a.attrelid = c.oid AND a.attnum = k.attnum
WHERE n.nspname = p_schema
AND c.relname = p_table
AND i.indisprimary
ORDER BY k.ord;
END IF;
END IF;
IF v_hash_arg IS NULL OR v_hash_arg = '' THEN
RAISE EXCEPTION
'yb_ttl_delete_bucket: no hash columns resolved for %.%. Provide p_hash_cols or define a PRIMARY KEY.',
p_schema, p_table;
END IF;
v_sql := format($f$
WITH doomed AS (
SELECT ybctid
FROM %1$I.%2$I
WHERE %3$I < now() - $1
AND yb_hash_code(%4$s) BETWEEN $2 AND $3
ORDER BY ybctid
LIMIT $4
)
DELETE FROM %1$I.%2$I e
USING doomed d
WHERE e.ybctid = d.ybctid
RETURNING 1
$f$, p_schema, p_table, p_ts_col, v_hash_arg);
EXECUTE v_sql USING p_retention, p_lo, p_hi, p_batch_limit INTO v_deleted;
RETURN COALESCE(v_deleted, 0);
END $$;
Example (Options A/B): delete one slice where expires_at is in the past.
SELECT yb_ttl_delete_bucket(
'public','your_table','expires_at',
'0 days', -- the column already encodes expiry
0, 4095, -- bucket slice (0..65535; width 4096 for 16 buckets)
5000
);
Sidecar variant (Option C)
This version joins your sidecar registry to find expired rows, then deletes from the base table in tiny hash slices.
CREATE OR REPLACE FUNCTION yb_ttl_delete_bucket_sidecar(
p_schema_base text,
p_table_base text,
p_schema_ttl text,
p_table_ttl text,
p_retention interval, -- usually '0 days'
p_lo int, -- inclusive 0..65535
p_hi int, -- inclusive
p_batch_limit int DEFAULT 5000,
p_hash_cols text[] DEFAULT NULL -- optional override of hash cols
) RETURNS integer
LANGUAGE plpgsql AS $$
DECLARE
v_sql text;
v_deleted int;
v_pkdef text;
v_csv text;
v_hasharg text; -- "b.col1,b.col2,..."
v_pklist text; -- "pk1,pk2,..."
BEGIN
-- Resolve hash columns: explicit override wins, else derive from PK (prefer HASH key)
IF p_hash_cols IS NOT NULL AND array_length(p_hash_cols,1) IS NOT NULL THEN
SELECT string_agg(format('b.%I', col), ',') INTO v_hasharg
FROM unnest(p_hash_cols) AS col;
ELSE
SELECT pg_get_indexdef(i.indexrelid) INTO v_pkdef
FROM pg_index i
JOIN pg_class c ON c.oid = i.indrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname = p_schema_base
AND c.relname = p_table_base
AND i.indisprimary;
IF v_pkdef IS NULL THEN
RAISE EXCEPTION '%.% has no PRIMARY KEY', p_schema_base, p_table_base;
END IF;
-- Try explicit HASH key first
SELECT (regexp_matches(v_pkdef, '\(\(([^)]*)\)\s*HASH', 'i'))[1] INTO v_csv;
IF v_csv IS NOT NULL THEN
SELECT string_agg(format('b.%I', btrim(s)), ',') INTO v_hasharg
FROM unnest(string_to_array(v_csv, ',')) AS s;
ELSE
-- Fallback: all PK cols in key order (ORDER BY inside string_agg)
SELECT string_agg(format('b.%I', a.attname), ',' ORDER BY k.ord) INTO v_hasharg
FROM pg_index i
JOIN pg_class c ON c.oid = i.indrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
JOIN unnest(i.indkey) WITH ORDINALITY k(attnum, ord) ON TRUE
JOIN pg_attribute a ON a.attrelid = c.oid AND a.attnum = k.attnum
WHERE n.nspname = p_schema_base
AND c.relname = p_table_base
AND i.indisprimary;
END IF;
END IF;
IF v_hasharg IS NULL OR v_hasharg = '' THEN
RAISE EXCEPTION 'No hash columns resolved for %.%. Provide p_hash_cols or define a PRIMARY KEY.',
p_schema_base, p_table_base;
END IF;
-- Build USING(pk1,pk2,...) for the equality join (ORDER BY inside string_agg)
SELECT string_agg(format('%I', a.attname), ',' ORDER BY k.ord) INTO v_pklist
FROM pg_index i
JOIN pg_class c ON c.oid = i.indrelid
JOIN pg_namespace n ON n.oid = c.relnamespace
JOIN unnest(i.indkey) WITH ORDINALITY k(attnum, ord) ON TRUE
JOIN pg_attribute a ON a.attrelid = c.oid AND a.attnum = k.attnum
WHERE n.nspname = p_schema_base
AND c.relname = p_table_base
AND i.indisprimary;
v_sql := format($f$
WITH doomed AS (
SELECT b.ybctid
FROM %1$I.%2$I b
JOIN %3$I.%4$I t USING (%5$s) -- USING(pk1, pk2, ...)
WHERE t.expires_at < now() - $1
AND yb_hash_code(%6$s) BETWEEN $2 AND $3
ORDER BY b.ybctid
LIMIT $4
)
DELETE FROM %1$I.%2$I b
USING doomed d
WHERE b.ybctid = d.ybctid
RETURNING 1
$f$, p_schema_base, p_table_base, p_schema_ttl, p_table_ttl, v_pklist, v_hasharg);
EXECUTE v_sql USING p_retention, p_lo, p_hi, p_batch_limit INTO v_deleted;
RETURN COALESCE(v_deleted, 0);
END $$;
Schedule tiny jobs with pg_cron
Cover the full hash space with small, staggered tasks. Sixteen buckets → width 4096.
-- Example: four of the sixteen; repeat pattern for b4..b15
SELECT cron.schedule('ttl_tbl_b0', '* * * * *',
$$SELECT yb_ttl_delete_bucket('public','your_table','expires_at','0 days', 0, 4095, 5000);$$);
SELECT cron.schedule('ttl_tbl_b1', '*/2 * * * *',
$$SELECT yb_ttl_delete_bucket('public','your_table','expires_at','0 days', 4096, 8191, 5000);$$);
SELECT cron.schedule('ttl_tbl_b2', '*/3 * * * *',
$$SELECT yb_ttl_delete_bucket('public','your_table','expires_at','0 days', 8192, 12287, 5000);$$);
SELECT cron.schedule('ttl_tbl_b3', '*/4 * * * *',
$$SELECT yb_ttl_delete_bucket('public','your_table','expires_at','0 days', 12288, 16383, 5000);$$);
Tuning tips: keep p_batch_limit small (2k–10k), scale by more buckets / more frequent runs. If you can adopt time-range partitioning later, switch to drop/detach for the fastest expiry.
Soft TTL (hide first, purge later)
If you just need to stop serving expired rows:
-- With Options A/B:
CREATE VIEW your_table_visible AS
SELECT * FROM your_table WHERE expires_at >= now();
-- Or RLS:
ALTER TABLE your_table ENABLE ROW LEVEL SECURITY;
CREATE POLICY keep_fresh_rows ON your_table
USING (expires_at >= now());
With the sidecar, join the registry in the view/RLS predicate.
One-row smoke test (forces a delete in the right slice)
yugabyte=# \d your_table
Table "public.your_table"
Column | Type | Collation | Nullable | Default
------------+--------------------------+-----------+----------+---------
c1 | integer | | not null |
expires_at | timestamp with time zone | | |
Indexes:
"your_table_pkey" PRIMARY KEY, lsm (c1 HASH)
Triggers:
trg_set_expires_ins BEFORE INSERT ON your_table FOR EACH ROW EXECUTE FUNCTION set_expires_on_insert()
yugabyte=# -- 1) Insert an expired row
yugabyte=# INSERT INTO your_table VALUES (1, now() - interval '1 hour');
INSERT 0 1
yugabyte=# SELECT * FROM your_table;
c1 | expires_at
----+-------------------------------
1 | 2025-09-16 14:22:59.211378+00
(1 row)
yugabyte=# -- 2) Find its slice
yugabyte=# WITH h AS (
yugabyte(# SELECT yb_hash_code(c1) AS hc FROM your_table WHERE c1 = 1
yugabyte(# )
yugabyte-# SELECT (hc/4096)::int AS bucket,
yugabyte-# ( (hc/4096)::int * 4096 ) AS lo,
yugabyte-# ( (hc/4096)::int * 4096 + 4095 ) AS hi
yugabyte-# FROM h;
bucket | lo | hi
--------+------+------
1 | 4096 | 8191
(1 row)
-- 3) Run the delete for that slice (replace :lo/:hi)
yugabyte=# SELECT yb_ttl_delete_bucket('public','your_table','expires_at','0 days', 4096, 8191, 5000);
yb_ttl_delete_bucket
----------------------
1
(1 row)
yugabyte=# SELECT * FROM your_table;
c1 | expires_at
----+------------
(0 rows)
Wrap-up
● YCQL has native per-row TTL (see: https://docs.yugabyte.com/preview/develop/learn/ttl-data-expiration-ycql/#row-level-ttl).
● In YSQL, emulate it by storing an expiry (column or sidecar) and running tiny, hash-bucketed delete jobs (or go straight to time partitions and drop/detach).
● These approaches are production-ready and keep your cluster healthy while delivering YCQL-style behavior on the SQL side.
Have Fun!
