Boost Performance by Splitting and Distributing Massive UPDATE Operations

The UPDATE statement modifies the values of specified columns in all rows that meet certain conditions. When no conditions are specified in the WHERE clause, all rows are updated. The UPDATE operation also outputs the number of rows that have been updated.

Example:

				
					[yugabyte@ip-10-36-1-66 ~]$ alias y
alias y='ysqlsh -h $(hostname -I)'

[yugabyte@ip-10-36-1-66 ~]$ y -c "\d some_table;"
            Table "public.some_table"
 Column |  Type  | Collation | Nullable | Default
--------+--------+-----------+----------+---------
 c1     | bigint |           | not null |
 c2     | bigint |           | not null |
 c3     | text   |           |          |
Indexes:
    "some_table_pkey" PRIMARY KEY, lsm ((c1, c2) HASH)

[yugabyte@ip-10-36-1-66 ~]$ y -c "\timing on" -c "SELECT COUNT(*) FROM some_table;"
Timing is on.
  count
---------
 5000000
(1 row)

Time: 633.549 ms

Suppose I want to update the C3 column for all rows in the table. How long would that take using a standard UPDATE statement?

				
					[yugabyte@ip-10-36-1-66 ~]$ y -c "\timing on" -c "UPDATE some_table SET c3 = 'ZZ';"
Timing is on.
UPDATE 5000000
Time: 163448.695 ms (02:43.449)

That took about 2.5 minutes.

Let’s break down the UPDATE into smaller parts using the YB_HASH_CODE function to target specific partition keys, then execute these smaller UPDATEs in parallel across the cluster.

To simplify generating the smaller UPDATE statements, let’s use a stored procedure.

				
					CREATE OR REPLACE FUNCTION gen_updates(u TEXT, s INT, lb BOOLEAN DEFAULT FALSE, tf TEXT DEFAULT 'update_log.txt')
  RETURNS TABLE(ddl text)
AS $$
DECLARE
  table_name TEXT;
  partition_key TEXT;
  ips TEXT[];
  BEGIN
  table_name := TRIM(BOTH ' ' FROM SUBSTRING(LOWER(u) FROM 'update(.+)set'));
  partition_key = (SELECT translate(substring(pg_catalog.pg_get_indexdef(i.indexrelid, 0, true) from 'lsm(.+)HASH'), '() ', '') FROM pg_catalog.pg_class c, pg_catalog.pg_class c2, pg_catalog.pg_index i LEFT JOIN pg_catalog.pg_constraint con ON (conrelid = i.indrelid AND conindid = i.indexrelid AND contype = 'p') WHERE c.oid = table_name::regclass::oid AND c.oid = i.indrelid AND i.indexrelid = c2.oid ORDER BY i.indisprimary DESC, i.indisunique DESC, c2.relname);  
  IF lb THEN
    ips=ARRAY(SELECT host FROM yb_servers());
    RETURN QUERY (WITH a AS (SELECT generate_series(1, s) g) SELECT 'ysqlsh -h ' || ips[mod(row_number() over(), array_length(ips, 1))+1] || ' -c "' || u || ' WHERE yb_hash_code(' || partition_key || ') >= ' || (65536 - (g*(65536 / s))) || ' AND ' || 'yb_hash_code(' || partition_key || ') < ' || ((65536) - (g*(65536 / s)) + (65536 / s)) || ';" | awk ''{$1= ""; print $0}'' >> ' || tf || ' &' AS sql FROM a ORDER BY g DESC);
  ELSE
    ips=ARRAY(SELECT SPLIT_PART(inet_server_addr()::TEXT, '/', 1)); 
    RETURN QUERY (WITH a AS (SELECT generate_series(1, s) g) SELECT 'ysqlsh -h ' || ips[1] || ' -c "' || u || ' WHERE yb_hash_code(' || partition_key || ') >= ' || (65536 - (g*(65536 / s))) || ' AND ' || 'yb_hash_code(' || partition_key || ') < ' || ((65536) - (g*(65536 / s)) + (65536 / s)) || ';"  | awk ''{$1= ""; print $0}' >>  ' || tf || ' &''  AS sql FROM a ORDER BY g DESC);
  END IF;
END;
$$ language plpgsql;

The stored procedure accepts four parameters:

u: A TEXT value representing the UPDATE statement. The procedure will extract the table name from this UPDATE and identify the partition key column(s) for the table.
s: An INT value representing the number of segments to split the UPDATE statement into. Optimal values are powers of 2 (e.g., 2, 4, 8, 16, 32, etc.).
lb: A BOOLEAN flag that, when set to TRUE, enables load balancing across the cluster; if set to FALSE, only the connected local node is used. The default is FALSE.
tf: A TEXT value specifying the log file for recording row update counts. The default is 'update_log.txt'.

Here’s an example of the output from the stored procedure, where an UPDATE is split into two parts and load-balanced to run on separate nodes.

				
					yugabyte=# SELECT gen_updates('UPDATE some_table SET c3 = ''Z''', 2, TRUE);
                                                                                gen_updates

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ysqlsh -h 198.71.48.248 -c "UPDATE some_table SET c3 = 'Z' WHERE yb_hash_code(c1,c2) >= 0 AND yb_hash_code(c1,c2) < 32768;" | awk '{$1= ""; print $0}' >> update_log.txt &
 ysqlsh -h 127.0.0.3 -c "UPDATE some_table SET c3 = 'Z' WHERE yb_hash_code(c1,c2) >= 32768 AND yb_hash_code(c1,c2) < 65536;" | awk '{$1= ""; print $0}' >> update_log.txt &
(2 rows)

In this test, we’ll use the stored procedure to generate 64 smaller updates and run them all in parallel in the background.

I am using the Linux time function to measure how long it takes for the total row counts in the update_log.txt file to match the row count of the table, indicating that all rows have been updated.

				
					[yugabyte@ip-10-36-1-66 ~]$ echo '0' > update_log.txt

[yugabyte@ip-10-36-1-66 ~]$ cat update_log.txt
0

[yugabyte@ip-10-36-1-66 ~]$ y -Atc "SELECT gen_updates('UPDATE some_table SET c3 = ''Z''', 64, TRUE);" | bash
[yugabyte@ip-10-36-1-66 ~]$ time while [ $(awk '{ sum += $1 } END { print sum }' update_log.txt) -ne 5000000 ]; do :; done;

real	0m28.070s
user	0m12.431s
sys	0m13.246s

Wow! That’s an impressive boost in performance!

				
					yugabyte=# SELECT ((163448.695 - 28070) / 163448.695 * 100)::NUMERIC(5, 2) AS "% Boost in Performance!";
 % Boost in Performance!
-------------------------
                   82.83
(1 row)

Have Fun!