Motivation

  • After Min-Hash compression, still too expensive to compare every pair.
  • LSH identifies likely similar pairs efficiently.

Locality-Sensitive Hashing

  • Goal: Find pairs with similarity ≥ s.
  • Method:
    1. Divide signature matrix into b bands of r rows.
    2. Hash each band into buckets.
    3. Pairs sharing a bucket → candidates.

Probability Formulation


Example

  • Parameters: b = 20, r = 5, threshold s = 0.75.
  • 80% similar pairs → found with ~99.97% probability.
  • 30% similar pairs → only ~4.7% false positives.

Trade-off

ParameterEffect
Increase rFewer false positives
Increase bFewer false negatives

Distance Measures

Euclidean Distance (L₂ norm)

Manhattan Distance (L₁ norm)

Jaccard Distance

Cosine Distance

Edit Distance

Count of insertions/deletions to transform one string into another.

Hamming Distance

Number of differing bit positions between two binary vectors.


Summary

  • LSH reduces comparisons drastically for large-scale similarity detection.
  • Choice of distance measure depends on data representation and domain.