The UPDATE
statement modifies the values of specified columns in all rows that meet certain conditions. When no conditions are specified in the WHERE
clause, all rows are updated. The UPDATE
operation also outputs the number of rows that have been updated.
Example:
[yugabyte@ip-10-36-1-66 ~]$ alias y
alias y='ysqlsh -h $(hostname -I)'
[yugabyte@ip-10-36-1-66 ~]$ y -c "\d some_table;"
Table "public.some_table"
Column | Type | Collation | Nullable | Default
--------+--------+-----------+----------+---------
c1 | bigint | | not null |
c2 | bigint | | not null |
c3 | text | | |
Indexes:
"some_table_pkey" PRIMARY KEY, lsm ((c1, c2) HASH)
[yugabyte@ip-10-36-1-66 ~]$ y -c "\timing on" -c "SELECT COUNT(*) FROM some_table;"
Timing is on.
count
---------
5000000
(1 row)
Time: 633.549 ms
C3
column for all rows in the table. How long would that take using a standard UPDATE
statement?
[yugabyte@ip-10-36-1-66 ~]$ y -c "\timing on" -c "UPDATE some_table SET c3 = 'ZZ';"
Timing is on.
UPDATE 5000000
Time: 163448.695 ms (02:43.449)
That took about 2.5 minutes.
UPDATE
into smaller parts using the YB_HASH_CODE
function to target specific partition keys, then execute these smaller UPDATE
s in parallel across the cluster. UPDATE
statements, let’s use a stored procedure.
CREATE OR REPLACE FUNCTION gen_updates(u TEXT, s INT, lb BOOLEAN DEFAULT FALSE, tf TEXT DEFAULT 'update_log.txt')
RETURNS TABLE(ddl text)
AS $$
DECLARE
table_name TEXT;
partition_key TEXT;
ips TEXT[];
BEGIN
table_name := TRIM(BOTH ' ' FROM SUBSTRING(LOWER(u) FROM 'update(.+)set'));
partition_key = (SELECT translate(substring(pg_catalog.pg_get_indexdef(i.indexrelid, 0, true) from 'lsm(.+)HASH'), '() ', '') FROM pg_catalog.pg_class c, pg_catalog.pg_class c2, pg_catalog.pg_index i LEFT JOIN pg_catalog.pg_constraint con ON (conrelid = i.indrelid AND conindid = i.indexrelid AND contype = 'p') WHERE c.oid = table_name::regclass::oid AND c.oid = i.indrelid AND i.indexrelid = c2.oid ORDER BY i.indisprimary DESC, i.indisunique DESC, c2.relname);
IF lb THEN
ips=ARRAY(SELECT host FROM yb_servers());
RETURN QUERY (WITH a AS (SELECT generate_series(1, s) g) SELECT 'ysqlsh -h ' || ips[mod(row_number() over(), array_length(ips, 1))+1] || ' -c "' || u || ' WHERE yb_hash_code(' || partition_key || ') >= ' || (65536 - (g*(65536 / s))) || ' AND ' || 'yb_hash_code(' || partition_key || ') < ' || ((65536) - (g*(65536 / s)) + (65536 / s)) || ';" | awk ''{$1= ""; print $0}'' >> ' || tf || ' &' AS sql FROM a ORDER BY g DESC);
ELSE
ips=ARRAY(SELECT SPLIT_PART(inet_server_addr()::TEXT, '/', 1));
RETURN QUERY (WITH a AS (SELECT generate_series(1, s) g) SELECT 'ysqlsh -h ' || ips[1] || ' -c "' || u || ' WHERE yb_hash_code(' || partition_key || ') >= ' || (65536 - (g*(65536 / s))) || ' AND ' || 'yb_hash_code(' || partition_key || ') < ' || ((65536) - (g*(65536 / s)) + (65536 / s)) || ';" | awk ''{$1= ""; print $0}' >> ' || tf || ' &'' AS sql FROM a ORDER BY g DESC);
END IF;
END;
$$ language plpgsql;
The stored procedure accepts four parameters:
u
: A TEXT value representing theUPDATE
statement. The procedure will extract the table name from thisUPDATE
and identify the partition key column(s) for the table.s
: An INT value representing the number of segments to split theUPDATE
statement into. Optimal values are powers of 2 (e.g., 2, 4, 8, 16, 32, etc.).lb
: A BOOLEAN flag that, when set toTRUE
, enables load balancing across the cluster; if set toFALSE
, only the connected local node is used. The default isFALSE
.tf
: A TEXT value specifying the log file for recording row update counts. The default is'update_log.txt'
.
Here’s an example of the output from the stored procedure, where an UPDATE
is split into two parts and load-balanced to run on separate nodes.
yugabyte=# SELECT gen_updates('UPDATE some_table SET c3 = ''Z''', 2, TRUE);
gen_updates
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ysqlsh -h 198.71.48.248 -c "UPDATE some_table SET c3 = 'Z' WHERE yb_hash_code(c1,c2) >= 0 AND yb_hash_code(c1,c2) < 32768;" | awk '{$1= ""; print $0}' >> update_log.txt &
ysqlsh -h 127.0.0.3 -c "UPDATE some_table SET c3 = 'Z' WHERE yb_hash_code(c1,c2) >= 32768 AND yb_hash_code(c1,c2) < 65536;" | awk '{$1= ""; print $0}' >> update_log.txt &
(2 rows)
In this test, we’ll use the stored procedure to generate 64 smaller updates and run them all in parallel in the background.
I am using the Linux time
function to measure how long it takes for the total row counts in the update_log.txt
file to match the row count of the table, indicating that all rows have been updated.
[yugabyte@ip-10-36-1-66 ~]$ echo '0' > update_log.txt
[yugabyte@ip-10-36-1-66 ~]$ cat update_log.txt
0
[yugabyte@ip-10-36-1-66 ~]$ y -Atc "SELECT gen_updates('UPDATE some_table SET c3 = ''Z''', 64, TRUE);" | bash
[yugabyte@ip-10-36-1-66 ~]$ time while [ $(awk '{ sum += $1 } END { print sum }' update_log.txt) -ne 5000000 ]; do :; done;
real 0m28.070s
user 0m12.431s
sys 0m13.246s
Wow! That’s an impressive boost in performance!
yugabyte=# SELECT ((163448.695 - 28070) / 163448.695 * 100)::NUMERIC(5, 2) AS "% Boost in Performance!";
% Boost in Performance!
-------------------------
82.83
(1 row)
Have Fun!