Motivation

  • A-Priori reduces pairs, but scanning disk multiple times is costly.
  • Improve efficiency using better memory utilization and hashing.

Hashing

  • Hash function:
  • Maps pairs to buckets instead of storing them explicitly.
  • Reduce memory by tracking bucket counts.

PCY (Park-Chen-Yu) Algorithm

Idea

  • Exploit unused memory during Pass 1 of A-Priori.
  • Each pair of items hashed into buckets.
  • Only buckets exceeding threshold count are “frequent.”

Process

  1. Pass 1:
    • Count individual items.
    • Hash all pairs into buckets (store counts).
  2. Between passes:
    • Convert bucket counts into bit-vector (1 if frequent).
  3. Pass 2:
    • Only count pairs if: - Both items are frequent.
      • Pair hashes to a frequent bucket.

Refinements

Multistage

  • Adds a third pass with a second independent hash function.
  • Further eliminates false positives.

Multihash

  • Uses multiple hash tables simultaneously during the first pass.
  • Fewer passes with similar accuracy.

Advanced Methods

Random Sampling

  • Run frequent itemset mining on a random subset.
  • Adjust support threshold proportionally.

SON Algorithm (Savasere, Omiecinski, Navathe)

  • Divide data into chunks.
  • Find frequent itemsets within each chunk.
  • Combine all candidate itemsets.

Summary

  • PCY improves A-Priori using hashing.
  • Multistage & Multihash reduce false positives.
  • SON allows parallel and distributed mining.