Motivation
- After Min-Hash compression, still too expensive to compare every pair.
- LSH identifies likely similar pairs efficiently.
Locality-Sensitive Hashing
- Goal: Find pairs with similarity ≥ s.
- Method:
- Divide signature matrix into b bands of r rows.
- Hash each band into buckets.
- Pairs sharing a bucket → candidates.
Probability Formulation
Example
- Parameters: b = 20, r = 5, threshold s = 0.75.
- 80% similar pairs → found with ~99.97% probability.
- 30% similar pairs → only ~4.7% false positives.
Trade-off
| Parameter | Effect |
|---|---|
| Increase r | Fewer false positives |
| Increase b | Fewer false negatives |
Distance Measures
Euclidean Distance (L₂ norm)
Manhattan Distance (L₁ norm)
Jaccard Distance
Cosine Distance
Edit Distance
Count of insertions/deletions to transform one string into another.
Hamming Distance
Number of differing bit positions between two binary vectors.
Summary
- LSH reduces comparisons drastically for large-scale similarity detection.
- Choice of distance measure depends on data representation and domain.