Lecture Notes

❯

❯

Lecture 4 Finding Similar Items

Lecture 4 - Finding Similar Items

Apr 14, 20262 min read

Goal

Identify items or documents that are similar in high-dimensional space.
Applications:
- Duplicate detection
- Plagiarism detection
- Recommendation systems

Similarity Measures

Jaccard Similarity

$(s im (A, B) = \frac{∣ A \cap B ∣}{∣ A \cup B ∣})$

Distance = ( 1 - sim(A, B) )
Used for sets and binary vectors.

Shingling

Step 1:

Represent each document as a set of k-shingles (substrings of length k).
Example: ( k=2 ), “abcab” → {ab, bc, ca}
Hash shingles to integers for storage efficiency.

Representation

Each document becomes a sparse vector in a high-dimensional space.
Similarity: Jaccard similarity over shingles.

Min-Hashing

Step 2:

Goal: Compress large vectors into short signatures that preserve similarity.
Method:
1. Randomly permute rows (or simulate with hash functions).
2. For each column (document), record first row where value = 1.
Probability that min-hash values match = Jaccard similarity.

Signature Matrix

Each column stores multiple min-hash values using different hash functions.
Signature similarity ≈ set similarity.

Locality-Sensitive Hashing (LSH)

Step 3:

Reduce pairwise comparisons from (O(N^2)) to (O(N)).
Partition signature matrix into b bands of r rows.
Within each band, hash signatures.
Candidate pairs: documents with identical bands in at least one hash.

Probability

$(P_{c an d i d a t e} = 1 - (1 - s^{r})^{b})$

( s ): similarity threshold.

Trade-offs

Increasing r reduces false positives, increases false negatives.
Increasing b has opposite effect.

Summary

Pipeline: Document → Shingles → Min-Hash → LSH
Efficient method to approximate document similarity at large scale.

Graph View

Goal
Similarity Measures
Jaccard Similarity
Shingling
Step 1:
Representation
Min-Hashing
Step 2:
Signature Matrix
Locality-Sensitive Hashing (LSH)
Step 3:
Probability
Trade-offs
Summary

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community