5.4 Clustering

🎓Intended Learning Outcomes

At the end of this lesson, you should be able to:

Reason about what the clustering problem is and understand sample use cases
Know and understand Kleinberg’s theorem on clustering
Understand what Agglomerative Clustering is and how it works
Discuss differences between complete, average and single distances used in agglomerative clustering.
Understand how the choice of stopping criterea for single linkage clustering impact the properties of the resulting clustering.

Why consider clustering?

In lesson 5.1 and 5.2 we discussed different ways in which we can measure similarities between individual datapoints $x_i, x_j\in\mathcal{X}$ in terms of their metric distance $d(x_i, x_j)$. However, we want to go further than that and be able to identify distinct groups of elements $\mathcal{X}$ that naturally belong together in some sense without being given explicit class labels about them.

<aside> 💡 In clustering problems, we are given a dataset without any explicit labels about group membership and want to instead discover natural groups or “clusters” in the dataset based on a given distance measure. Clustering is one of the most fundamental examples of unsupervised machine learning.

</aside>

Sample Applications:

Determine different customer groups based on behavior / shopping data
Discover patterns in disease
Determine styles of hand-writing
Discover different accents in speech

Remember Lecture ‣? We in fact already started to discuss clustering then and looked at an example of grouping different clothing items together with an algorithm called k-medoids to cluster these objects:

Recall our discussion on clustering in Lesson 3.6 using k-medoids?

Table of contents

🎓Intended Learning Outcomes

Why consider clustering?

A formal definition of what we mean by clustering