Lecture 5 - Locality-Sensitive Hashing (LSH) and Distance Metrics

Motivation

Goal: Find pairs with similarity ≥ s.
Method:
1. Divide signature matrix into b bands of r rows.
2. Hash each band into buckets.
3. Pairs sharing a bucket → candidates.

$(P (candidate) = 1 - (1 - s^{r})^{b})$

Parameter	Effect
Increase r	Fewer false positives
Increase b	Fewer false negatives

$(d (x, y) = \sum_{i} (x_{i} - y_{i})^{2})$

$(d (x, y) = \sum_{i} ∣ x_{i} - y_{i} ∣)$

$(d (A, B) = 1 - \frac{∣ A \cap B ∣}{∣ A \cup B ∣})$

$(d (x, y) = 1 - \frac{x \cdot y}{∥ x ∥∥ y ∥})$

Count of insertions/deletions to transform one string into another.

Number of differing bit positions between two binary vectors.