Primary keys are your best defense against duplicates, but sometimes you inherit a table without one, or you want to de-dupe by a non-key column. In YugabyteDB, the cleanup strategy depends on whether a primary key exists.
Let’s explore two scenarios: one with a table that includes a primary key, and another with a table that has no primary key at all.
CREATE TABLE test1 (c1 INT, c2 TEXT);
CREATE TABLE test2 (c1 INT PRIMARY KEY, c2 TEXT);
INSERT INTO test1 VALUES (1, 'A'), (2, 'B'), (3, 'C'), (4, 'A'), (5, 'D');
INSERT INTO test2 VALUES (1, 'A'), (2, 'B'), (3, 'C'), (4, 'A'), (5, 'D');
SELECT ybctid, * FROM test1;
SELECT ybctid, * FROM test2;
Removing duplicates when you have a primary key
When a table has a primary key (c1 in test2), you can use it directly to target rows for deletion. This is the cleanest approach.
Goal: keep one row per c2, delete extras.
WITH d AS (
SELECT c1,
ROW_NUMBER() OVER (PARTITION BY c2 ORDER BY c1) AS rn
FROM test2
)
DELETE FROM test2 t
USING d
WHERE t.c1 = d.c1
AND d.rn > 1;
yugabyte=# SELECT * FROM test2 ORDER BY c1;
c1 | c2
----+----
1 | A
2 | B
3 | C
4 | A
5 | D
(5 rows)
yugabyte=# WITH d AS (
yugabyte(# SELECT c1,
yugabyte(# ROW_NUMBER() OVER (PARTITION BY c2 ORDER BY c1) AS rn
yugabyte(# FROM test2
yugabyte(# )
yugabyte-# DELETE FROM test2 t
yugabyte-# USING d
yugabyte-# WHERE t.c1 = d.c1
yugabyte-# AND d.rn > 1;
DELETE 1
yugabyte=# SELECT * FROM test2 ORDER BY c1;
c1 | c2
----+----
1 | A
2 | B
3 | C
5 | D
(4 rows)
Removing duplicates when there is no primary key
For test1, there’s no PK—two identical rows can’t be distinguished by column values alone.
That’s where YugabyteDB’s hidden hero, ybctid, comes in.
ybctid is a unique internal row identifier. Even if two rows have identical column values, their ybctids are different.
View it:
yugabyte=# \d test1;
Table "public.test1"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
c1 | integer | | |
c2 | text | | |
yugabyte=# SELECT ybctid, * FROM test1 ORDER By c1;
ybctid | c1 | c2
----------------------------------------------------+----+----
\x47496653cacb55f1e4ab452495c123c412b6e16100002121 | 1 | A
\x478cb853e753424b7fed40b6b1533c1fe704e8b000002121 | 2 | B
\x47380d533dc7e5f68d844e2283032af167c3bc1a00002121 | 3 | C
\x477a6253f7edd9a93a154c90895e66f5c8502af800002121 | 4 | A
\x47129453670110ca1b1a4d2596490326d3a0977900002121 | 5 | D
(5 rows)
Cleanup using ybctid:
WITH d AS (
SELECT ybctid,
ROW_NUMBER() OVER (PARTITION BY c2 ORDER BY ybctid) AS rn
FROM test1
)
DELETE FROM test1 t
USING d
WHERE t.ybctid = d.ybctid
AND d.rn > 1;
yugabyte=# WITH d AS (
yugabyte(# SELECT ybctid,
yugabyte(# ROW_NUMBER() OVER (PARTITION BY c2 ORDER BY ybctid) AS rn
yugabyte(# FROM test1
yugabyte(# )
yugabyte-# DELETE FROM test1 t
yugabyte-# USING d
yugabyte-# WHERE t.ybctid = d.ybctid
yugabyte-# AND d.rn > 1;
DELETE 1
yugabyte=# SELECT * FROM test1 ORDER BY c1;
c1 | c2
----+----
1 | A
2 | B
3 | C
5 | D
(4 rows)
Variations you’ll actually use
• Keep the latest row instead of the first:
ROW_NUMBER() OVER (PARTITION BY c2 ORDER BY c1 DESC)
• Composite “duplicate” definition:
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY c1)
Practical Notes
• If you have a PK, use it… it’s cleaner and portable.
• If you don’t,
ybctidis a lifesaver for one-time cleanups.- •
ybctidis exposed to the SQL layer starting in Yugabyte 2025.1. • Don’t depend on
ybctidin application logic, it’s an internal physical identifier, not a business key.• For ongoing duplicate prevention, add a PRIMARY KEY or UNIQUE constraint.
Have Fun!
