Goal
- Identify items or documents that are similar in high-dimensional space.
- Applications:
- Duplicate detection
- Plagiarism detection
- Recommendation systems
Similarity Measures
Jaccard Similarity
(sim(A,B)=∣A∪B∣∣A∩B∣)
- Distance = ( 1 - sim(A, B) )
- Used for sets and binary vectors.
Shingling
Step 1:
- Represent each document as a set of k-shingles (substrings of length k).
- Example: ( k=2 ), “abcab” → {ab, bc, ca}
- Hash shingles to integers for storage efficiency.
Representation
- Each document becomes a sparse vector in a high-dimensional space.
- Similarity: Jaccard similarity over shingles.
Min-Hashing
Step 2:
- Goal: Compress large vectors into short signatures that preserve similarity.
- Method:
- Randomly permute rows (or simulate with hash functions).
- For each column (document), record first row where value = 1.
- Probability that min-hash values match = Jaccard similarity.
Signature Matrix
- Each column stores multiple min-hash values using different hash functions.
- Signature similarity ≈ set similarity.
Locality-Sensitive Hashing (LSH)
Step 3:
- Reduce pairwise comparisons from (O(N^2)) to (O(N)).
- Partition signature matrix into b bands of r rows.
- Within each band, hash signatures.
- Candidate pairs: documents with identical bands in at least one hash.
Probability
(Pcandidate=1−(1−sr)b)
- ( s ): similarity threshold.
Trade-offs
- Increasing r reduces false positives, increases false negatives.
- Increasing b has opposite effect.
Summary
- Pipeline: Document → Shingles → Min-Hash → LSH
- Efficient method to approximate document similarity at large scale.