K-Means is one of the most widely used algorithms in unsupervised learning. Itโ€™s fast, intuitive, and works well in many practical scenarios. But its simplicity can sometimes hide important nuances.

In this post, weโ€™ll explore:

  • What K-Means actually does
  • The mathematical foundations
  • How to build a solid mind model
  • Its strengths, weaknesses, and ideal use cases

๐Ÿง  The Mental Model

Think of K-Means as a game of gravity:

  • You drop \(k\) invisible magnets (centroids) into the data space
  • Every point is attracted to the closest magnet
  • The magnets move to the center of mass of their assigned points
  • Repeat until the system stabilizes (i.e., minimal movement)

K-Means is essentially clustering by proximity to centroids โ€” iteratively adjusting those centroids until the data points around them are stable.

Core Parameters

1. k: Number of clusters

You must specify the number of clusters before the algorithm runs.

Thereโ€™s no โ€œperfectโ€ \(k\), so we typically use methods like:

  • Elbow Method
  • Silhouette Score
  • Gap Statistic

2. Distance Metric

Most implementations use Euclidean distance, though others are possible.

$$
\text{dist}(x, c) = \sqrt{\sum_{i=1}^{n} (x_i – c_i)^2}
$$

The K-Means Algorithm (Step by Step)

  1. Initialize \(k\) centroids
    • Randomly select \(k\) points as starting centroids (or use K-Means++)
  2. Assign each point to the nearest centroid
    • Creates temporary clusters
  3. Update centroids
    • Move each centroid to the mean of all points assigned to it
  4. Repeat steps 2โ€“3 until:
    • Centroids do not move significantly
    • Or a max number of iterations is reached

Mathematically Speaking

K-Means attempts to minimize the within-cluster sum of squares (WCSS):

$$
\text{WCSS} = \sum_{i=1}^{k} \sum_{x \in C_i} | x – \mu_i |^2
$$

Where:

  • \( C_i \): set of points in cluster \(i\)
  • \( \mu_i \)โ€‹: centroid of cluster \(i\)

This is not a guaranteed global optimum, because K-Means can get stuck in local minima โ€” hence, itโ€™s common to run it multiple times with different initializations.

Properties to Know

PropertyValue
TypeCentroid-based
InputNumeric, continuous features only
ScalabilityVery high (efficient even for large datasets)
OutputCluster assignments + centroid positions
Complexity\( O(n \cdot k \cdot i \cdot d) \)
Deterministic?โŒ No (unless using K-Means++ plus fixed seed)

Choosing \(k\): The Elbow Method

Plot WCSS as a function of \(k\):

  • As \(k\) increases, WCSS decreases
  • Look for the โ€œelbowโ€ โ€” the point where WCSS flattens
  • Thatโ€™s often a good balance between under- and overfitting

Use Cases

  • Customer segmentation
  • Image compression (e.g., color quantization)
  • Document clustering (on TF-IDF or embedding vectors)
  • Market basket analysis

Limitations

  • Requires \(k\) to be known in advance
  • Assumes clusters are spherical and equally sized
  • Not great with non-convex shapes or varying densities
  • Sensitive to outliers

Final Thoughts

K-Means is fast, elegant, and surprisingly powerful for a wide range of tasks.
But its simplicity means it assumes a lot โ€” especially about the shape and size of clusters.

A good mental model for K-Means is this:

โ€œIt clusters by distance to a moving average โ€” and it gets confused by anything nonlinear, non-uniform, or noisy.โ€