Locality Sensitive Hashing

Mon 15 May 2023

Co-Founder

4 min read

Locality Sensitive Hashing (LSH) is a computer science technique for designing hash functions for high-dimensional data. LSH aims to maximise the probability that similar items map to the same hash bucket. The principle is to amplify the difference between similar and dissimilar items, keeping similar items closer together while pushing dissimilar items farther apart. This gives a probabilistic approach to nearest neighbour search, a problem that appears in machine learning, data mining, and information retrieval.

Background and Usage of LSH

LSH was introduced by Piotr Indyk and Rajeev Motwani in 1998 to address the "curse of dimensionality," where traditional computational algorithms become significantly less efficient as the dimensionality of a problem increases. This makes LSH useful in contexts dealing with large and complex data sets, including text analysis, recommendation systems, image recognition, and genome sequencing.

The general idea is to use hashing to map high-dimensional data into a lower-dimensional hash space. The hashing scheme is designed so that collisions are much more likely for objects close to each other in the original space than for objects far apart. This supports sub-linear time complexity for nearest neighbour searches, which matters when working with large data sets.

LSH Variant for Load Balancing

LSH is also used in load balancing for distributed systems. A variant of LSH, known as Consistent Hashing, is often used for this purpose. In consistent hashing, the output range of a hash function is treated as a fixed circular space or "ring" (think of this as a circular hash table). Each node or server in the system is assigned a random value within this space, and each data item or request is assigned to the node closest to it in the hash space.

The practical advantage of consistent hashing is that when a node is added or removed, only an average of 1/n keys need to be remapped, where n is the number of servers. Conventional hash functions would require almost all keys to be remapped. This property makes consistent hashing useful in dynamic environments where the server set changes often, such as in a CDN (Content Delivery Network).

LSH in CDNs for Load Splitting and Caching

In a CDN, data is distributed across multiple servers in different geographical regions. The objective is to let users access the data they need as quickly as possible. This is achieved by routing each user's request to the server closest to them that has the data they want.

Consistent hashing, a variant of LSH, fits this scenario. By hashing both data and user requests, we can assign each request to the server closest in the hash space. This approach gives an even distribution of load across all servers, improving overall system performance.

LSH can also make CDN caching more efficient. When a request for a specific piece of content comes in, the CDN can use the hash function to find the server holding that content. The content can then be served quickly, improving the user experience.

LSH and Server Pool Rebalancing

LSH also helps with rebalancing server pools. In dynamic environments, workload changes constantly, and servers may be added or removed frequently. This can create an imbalance in load distribution across servers.

Consistent hashing mitigates this issue. As mentioned earlier, when a server is added or removed, consistent hashing only requires a minimal reassignment of keys. This keeps load redistribution to a minimum, so the system can quickly adapt while maintaining a balanced load across all servers.

Maglev and LSH

Maglev is a network load balancer developed by Google that uses a distinctive approach to connection distribution. It employs a variant of consistent hashing to balance load across a pool of servers. Using a large lookup table and a consistent hashing-like algorithm, Maglev can evenly distribute traffic across servers, even when servers are added or removed.

Maglev's algorithm starts by assigning a pseudorandom permutation of backend servers for each entry in the lookup table. These entries are then filled by iterating over the permutation and assigning each entry to a server. If a server becomes unavailable, its entries are reassigned to other servers using the same permutation. If a server becomes available, it is gradually reintroduced into the permutation, minimising disruption. This is a practical demonstration of LSH's principles applied to server load, high availability, and performance.

Conclusion

Locality Sensitive Hashing has applications from high-dimensional data analysis to load balancing in distributed systems. Its property of bringing similar items closer in the hash space and pushing dissimilar items farther apart makes it useful for handling large and complex datasets. In CDNs and server pool management, LSH, and its variant Consistent Hashing, support balanced load distribution and efficient caching, improving system performance and user experience. Its application in Google's Maglev shows how these principles can be used in production-scale load balancing.

#SEO