Lecture Notes

❯

❯

Lecture 6 Clustering

Lecture 6 - Clustering

Apr 14, 20262 min read

Overview

Group similar data points into clusters.
Goal: maximize intra-cluster similarity, minimize inter-cluster similarity.

Types of Clustering

Hierarchical (Agglomerative / Divisive)
- Bottom-up: Merge closest clusters iteratively.
- Top-down: Start with all data → split recursively.
Point Assignment (e.g., k-means)
- Maintain centroids.
- Assign each point to nearest centroid.

Distance Measures

Euclidean
Cosine
Jaccard
Edit, etc.

Hierarchical Clustering

Key Questions

How to represent a cluster?
- Centroid: Mean of points (Euclidean).
- Clustroid: Actual point closest to all others (Non-Euclidean).
How to measure nearness?
- Distance between centroids.
- Minimum or average distance.
When to stop?
- Known number of clusters.
- Distance/diameter threshold.
- Until one cluster remains (dendrogram).

k-Means Algorithm

Steps

Choose k clusters.
Initialize k centroids.
Assignment: Assign each point to nearest centroid.
Update: Recalculate each centroid.
Iterate until convergence (stable centroids).

Choosing k

Use elbow method: find where average distance stops decreasing.

BFR Algorithm (Bradley-Fayyad-Reina)

Extension of k-means for very large data.
Assumes clusters are Gaussian (normal distribution).
Uses three point sets:
- DS: Discard set (close to centroids).
- CS: Compression set (nearby but not assigned to a cluster).
- RS: Retained set (outliers).

Process

Read data in chunks from disk.
Assign points to closest DS clusters.
Cluster remaining points → new CS or RS.
Update cluster summaries (N, SUM, SUMSQ).
Merge CS and RS where appropriate.

Key Metric: Mahalanobis Distance

Accounts for variance in each dimension.

CURE Algorithm (Optional)

Handles non-spherical clusters.
Represents each cluster by multiple representative points.
Can model arbitrary shapes.

Summary

Hierarchical: builds a cluster tree (O(n³)).
k-means: efficient for spherical clusters.
BFR: scalable for massive datasets using summaries.
CURE: flexible for arbitrary shapes.

Graph View

Overview
Types of Clustering
Distance Measures
Hierarchical Clustering
Key Questions
k-Means Algorithm
Steps
Choosing k
BFR Algorithm (Bradley-Fayyad-Reina)
Process
Key Metric: Mahalanobis Distance
CURE Algorithm (Optional)
Summary

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community