Hash-based sharding is a method of distributing data across multiple database shards by applying a hash function to a specific key. The hash function generates a hash value, which is then used to determine the shard where the data should be stored. This approach ensures a more uniform distribution of data compared to direct key sharding.
Hash-based sharding still uses key ranges, just like direct key sharding. The difference is that your key is first transformed by a hash function (e.g., MURMUR or CITY), and the resulting hash value is then routed using key ranges defined over the hash space.
For example, if you are sharding based on user IDs, the hash function will take a user ID as input and produce a hash value. This hash value is then mapped to one of the available shards using key ranges. The same hash function is used consistently to ensure that the same key always maps to the same shard.
Hash-based sharding helps to evenly distribute the data and load across all shards, reducing the risk of hotspots and ensuring better performance and scalability. However, it can make range queries more complex, as the data for a given range of keys may be spread across multiple shards.
SPQR also supports composite (multi-column) sharding keys with hash functions, which is particularly useful when you have a low-cardinality first column combined with a high-cardinality second column.
How to use it?
To use a hash function as your distribution(sharding) key, you need to specify it in the administrative router or coordination console using the HASH FUNCTION keyword:
ALTER DISTRIBUTION ds1 ATTACH RELATION r4 DISTRIBUTION KEY col1 HASH FUNCTION CITY;
attach table
------------------------
relation name -> r4
distribution id -> ds1
(2 rows)
Then, you create key ranges over the hash space. For hash functions like MURMUR, the output is a 32-bit unsigned integer, so the hash space is 0 to 4294967295:
-- Split hash space between two shards
-- First half: 0 to 2147483647 -> shard01
-- Second half: 2147483648 to 4294967295 -> shard02
CREATE KEY RANGE krid1 FROM 0 ROUTE TO shard01 FOR DISTRIBUTION ds1;
add key range
---------------
bound -> 0
(1 row)
CREATE KEY RANGE krid2 FROM 2147483648 ROUTE TO shard02 FOR DISTRIBUTION ds1;
add key range
---------------
bound -> 2147483648
(1 row)
When creating multiple key ranges, create them in descending order by lower bound (highest first). This ensures proper range boundary handling.
For more examples, see this test.