Overview

  • Group similar data points into clusters.
  • Goal: maximize intra-cluster similarity, minimize inter-cluster similarity.

Types of Clustering

  1. Hierarchical (Agglomerative / Divisive)
    • Bottom-up: Merge closest clusters iteratively.
    • Top-down: Start with all data → split recursively.
  2. Point Assignment (e.g., k-means)
    • Maintain centroids.
    • Assign each point to nearest centroid.

Distance Measures

  • Euclidean
  • Cosine
  • Jaccard
  • Edit, etc.

Hierarchical Clustering

Key Questions

  1. How to represent a cluster?
    • Centroid: Mean of points (Euclidean).
    • Clustroid: Actual point closest to all others (Non-Euclidean).
  2. How to measure nearness?
    • Distance between centroids.
    • Minimum or average distance.
  3. When to stop?
    • Known number of clusters.
    • Distance/diameter threshold.
    • Until one cluster remains (dendrogram).

k-Means Algorithm

Steps

  1. Choose k clusters.
  2. Initialize k centroids.
  3. Assignment: Assign each point to nearest centroid.
  4. Update: Recalculate each centroid.
  5. Iterate until convergence (stable centroids).

Choosing k

  • Use elbow method: find where average distance stops decreasing.

BFR Algorithm (Bradley-Fayyad-Reina)

  • Extension of k-means for very large data.
  • Assumes clusters are Gaussian (normal distribution).
  • Uses three point sets:
    • DS: Discard set (close to centroids).
    • CS: Compression set (nearby but not assigned to a cluster).
    • RS: Retained set (outliers).

Process

  1. Read data in chunks from disk.
  2. Assign points to closest DS clusters.
  3. Cluster remaining points → new CS or RS.
  4. Update cluster summaries (N, SUM, SUMSQ).
  5. Merge CS and RS where appropriate.

Key Metric: Mahalanobis Distance

Accounts for variance in each dimension.


CURE Algorithm (Optional)

  • Handles non-spherical clusters.
  • Represents each cluster by multiple representative points.
  • Can model arbitrary shapes.

Summary

  • Hierarchical: builds a cluster tree (O(n³)).
  • k-means: efficient for spherical clusters.
  • BFR: scalable for massive datasets using summaries.
  • CURE: flexible for arbitrary shapes.