Goal

  • Identify items or documents that are similar in high-dimensional space.
  • Applications:
    • Duplicate detection
    • Plagiarism detection
    • Recommendation systems

Similarity Measures

Jaccard Similarity

  • Distance = ( 1 - sim(A, B) )
  • Used for sets and binary vectors.

Shingling

Step 1:

  • Represent each document as a set of k-shingles (substrings of length k).
  • Example: ( k=2 ), “abcab” → {ab, bc, ca}
  • Hash shingles to integers for storage efficiency.

Representation

  • Each document becomes a sparse vector in a high-dimensional space.
  • Similarity: Jaccard similarity over shingles.

Min-Hashing

Step 2:

  • Goal: Compress large vectors into short signatures that preserve similarity.
  • Method:
    1. Randomly permute rows (or simulate with hash functions).
    2. For each column (document), record first row where value = 1.
  • Probability that min-hash values match = Jaccard similarity.

Signature Matrix

  • Each column stores multiple min-hash values using different hash functions.
  • Signature similarity ≈ set similarity.

Locality-Sensitive Hashing (LSH)

Step 3:

  • Reduce pairwise comparisons from (O(N^2)) to (O(N)).
  • Partition signature matrix into b bands of r rows.
  • Within each band, hash signatures.
  • Candidate pairs: documents with identical bands in at least one hash.

Probability

  • ( s ): similarity threshold.

Trade-offs

  • Increasing r reduces false positives, increases false negatives.
  • Increasing b has opposite effect.

Summary

  • Pipeline: Document → Shingles → Min-Hash → LSH
  • Efficient method to approximate document similarity at large scale.