Hash Function Result Depends on Data Type

yb_hash_code is a function that returns the hash of a set of given input values using the hash function DocDB uses to shard its data.

In effect, it provides direct access to the hash value of any given row of a YSQL table, allowing one to infer a row’s physical location.

This enables an application to specify queries based on the physical location of a row or set of rows.

The hashed value returned from yb_hash_code is influenced by the data type of the value passed to it.

Example:

				
					yugabyte=# \x
Expanded display is on.

yugabyte=# SELECT YB_HASH_CODE(12345::INT)             "12345::INT",
yugabyte-#        YB_HASH_CODE(12345::VARCHAR(10))     "12345::VARCHAR(10)",
yugabyte-#        YB_HASH_CODE(12345::VARCHAR(100))    "12345::VARCHAR(100)",
yugabyte-#        YB_HASH_CODE(12345::NUMERIC(100,0))  "12345::NUMERIC(100,0)",
yugabyte-#        YB_HASH_CODE(12345::NUMERIC(100,5))  "12345::NUMERIC(100,5)",
yugabyte-#        YB_HASH_CODE(12345::NUMERIC(100,10)) "12345::NUMERIC(100,10)",
yugabyte-#        YB_HASH_CODE(12345::FLOAT)           "12345::FLOAT";
-[ RECORD 1 ]----------+------
12345::INT             | 42669
12345::VARCHAR(10)     | 22093
12345::VARCHAR(100)    | 22093
12345::NUMERIC(100,0)  | 11640
12345::NUMERIC(100,5)  | 11640
12345::NUMERIC(100,10) | 11640
12345::FLOAT           | 9431

When joining two tables a column that is part of the partition key, it’s important that the columns have the same data type.

Otherwise there is a chance the the data from one table might reside on a different node than the other table.

Example:

				
					yugabyte=# CREATE TABLE a (c INT PRIMARY KEY);
CREATE TABLE

yugabyte=# INSERT INTO a SELECT 1;
INSERT 0 1

yugabyte=# CREATE TABLE b ( c NUMERIC(10, 0) PRIMARY KEY);
CREATE TABLE

yugabyte=# INSERT INTO b SELECT 1;
INSERT 0 1

yugabyte=# SELECT COUNT(*) FROM a JOIN b ON b.c = a.c; -- This JOIN is okay...
 count
-------
     1
(1 row)

yugabyte=# SELECT COUNT(*) FROM a JOIN b ON YB_HASH_CODE(b.c) = YB_HASH_CODE(a.c); -- But the data is on different nodes!
 count
-------
     0
(1 row)

This scenario could result in higher query latency dependning on the network between the nodes.

It’s best practice to join tables on columns that have the same data type – especially in a distributed SQL database like YugabyteDB!

Have Fun!