Overview
- Group similar data points into clusters.
- Goal: maximize intra-cluster similarity, minimize inter-cluster similarity.
Types of Clustering
- Hierarchical (Agglomerative / Divisive)
- Bottom-up: Merge closest clusters iteratively.
- Top-down: Start with all data → split recursively.
- Point Assignment (e.g., k-means)
- Maintain centroids.
- Assign each point to nearest centroid.
Distance Measures
- Euclidean
- Cosine
- Jaccard
- Edit, etc.
Hierarchical Clustering
Key Questions
- How to represent a cluster?
- Centroid: Mean of points (Euclidean).
- Clustroid: Actual point closest to all others (Non-Euclidean).
- How to measure nearness?
- Distance between centroids.
- Minimum or average distance.
- When to stop?
- Known number of clusters.
- Distance/diameter threshold.
- Until one cluster remains (dendrogram).
k-Means Algorithm
Steps
- Choose k clusters.
- Initialize k centroids.
- Assignment: Assign each point to nearest centroid.
- Update: Recalculate each centroid.
- Iterate until convergence (stable centroids).
Choosing k
- Use elbow method: find where average distance stops decreasing.
BFR Algorithm (Bradley-Fayyad-Reina)
- Extension of k-means for very large data.
- Assumes clusters are Gaussian (normal distribution).
- Uses three point sets:
- DS: Discard set (close to centroids).
- CS: Compression set (nearby but not assigned to a cluster).
- RS: Retained set (outliers).
Process
- Read data in chunks from disk.
- Assign points to closest DS clusters.
- Cluster remaining points → new CS or RS.
- Update cluster summaries (N, SUM, SUMSQ).
- Merge CS and RS where appropriate.
Key Metric: Mahalanobis Distance
Accounts for variance in each dimension.
CURE Algorithm (Optional)
- Handles non-spherical clusters.
- Represents each cluster by multiple representative points.
- Can model arbitrary shapes.
Summary
- Hierarchical: builds a cluster tree (O(n³)).
- k-means: efficient for spherical clusters.
- BFR: scalable for massive datasets using summaries.
- CURE: flexible for arbitrary shapes.