Motivation
- A-Priori reduces pairs, but scanning disk multiple times is costly.
- Improve efficiency using better memory utilization and hashing.
Hashing
- Hash function: (h(i,j)=(i+j)modN)
- Maps pairs to buckets instead of storing them explicitly.
- Reduce memory by tracking bucket counts.
PCY (Park-Chen-Yu) Algorithm
Idea
- Exploit unused memory during Pass 1 of A-Priori.
- Each pair of items hashed into buckets.
- Only buckets exceeding threshold count are “frequent.”
Process
- Pass 1:
- Count individual items.
- Hash all pairs into buckets (store counts).
- Between passes:
- Convert bucket counts into bit-vector (1 if frequent).
- Pass 2:
- Only count pairs if:
- Both items are frequent.
- Pair hashes to a frequent bucket.
Refinements
Multistage
- Adds a third pass with a second independent hash function.
- Further eliminates false positives.
Multihash
- Uses multiple hash tables simultaneously during the first pass.
- Fewer passes with similar accuracy.
Advanced Methods
Random Sampling
- Run frequent itemset mining on a random subset.
- Adjust support threshold proportionally.
SON Algorithm (Savasere, Omiecinski, Navathe)
- Divide data into chunks.
- Find frequent itemsets within each chunk.
- Combine all candidate itemsets.
Summary
- PCY improves A-Priori using hashing.
- Multistage & Multihash reduce false positives.
- SON allows parallel and distributed mining.