Lecture Notes

❯

❯

Lecture 3 Improving Frequent Itemset Mining

Lecture 3 - Improving Frequent Itemset Mining

Apr 14, 20261 min read

Motivation

A-Priori reduces pairs, but scanning disk multiple times is costly.
Improve efficiency using better memory utilization and hashing.

Hashing

Hash function: $(h (i, j) = (i + j) mod N)$
Maps pairs to buckets instead of storing them explicitly.
Reduce memory by tracking bucket counts.

PCY (Park-Chen-Yu) Algorithm

Idea

Exploit unused memory during Pass 1 of A-Priori.
Each pair of items hashed into buckets.
Only buckets exceeding threshold count are “frequent.”

Process

Pass 1:
- Count individual items.
- Hash all pairs into buckets (store counts).
Between passes:
- Convert bucket counts into bit-vector (1 if frequent).
Pass 2:
- Only count pairs if: - Both items are frequent.
  - Pair hashes to a frequent bucket.

Refinements

Multistage

Adds a third pass with a second independent hash function.
Further eliminates false positives.

Multihash

Uses multiple hash tables simultaneously during the first pass.
Fewer passes with similar accuracy.

Advanced Methods

Random Sampling

Run frequent itemset mining on a random subset.
Adjust support threshold proportionally.

SON Algorithm (Savasere, Omiecinski, Navathe)

Divide data into chunks.
Find frequent itemsets within each chunk.
Combine all candidate itemsets.

Summary

PCY improves A-Priori using hashing.
Multistage & Multihash reduce false positives.
SON allows parallel and distributed mining.

Graph View

Motivation
Hashing
PCY (Park-Chen-Yu) Algorithm
Idea
Process
Refinements
Multistage
Multihash
Advanced Methods
Random Sampling
SON Algorithm (Savasere, Omiecinski, Navathe)
Summary

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community