Concepts
Overview: K-means on Arbitrary Metric Spaces
The Quantization Problem
The key insight of kmeanssa-ng is that k-means can be generalized to
quantization of probability distributions on almost any metric space
\((\mathcal{M}, d)\), not just Euclidean space. Given a probability
measure \(P\) on \((\mathcal{M}, d)\), we seek \(k\) centers that minimize:
where \(d: \mathcal{M} \times \mathcal{M} \to \mathbb{R}_+\) is a distance function.
Key distinction: We are not clustering a fixed dataset, but rather quantizing a probability distribution. The algorithm is online: observations arrive sequentially according to a Poisson process, and each new observation \(Y_k \sim P\) causes a drift that pulls the nearest center toward it. Between observations, centers explore the space via Brownian motion. This homogenized version of Simulated Annealing processes data incrementally without storing all observations.
Connection to classical k-means: When \(P\) is a uniform distribution over a finite set of data points, this reduces to the standard k-means problem. But the framework is much more general:
- For graph clustering without weights: use \(P = \text{Uniform}(V)\) over vertices
- For weighted graphs: use \(P\) proportional to vertex weights
- For continuous spaces: use any continuous probability distribution on the metric space
This opens up quantization to:
- Metric graphs (quantum graphs): networks where points can exist anywhere on edges
- Riemannian manifolds: curved spaces with geodesic distances
- Custom spaces: any domain with a meaningful distance metric and probability measure
Why This Matters
Many real-world problems involve probability distributions on non-Euclidean spaces:
- Network data (social networks, transportation, molecules)
- Geographical data on curved surfaces
- Data constrained to manifolds (e.g., directional data on spheres)
Standard k-means fails in these settings because Euclidean distance is
inappropriate. kmeanssa-ng provides a principled framework for
quantizing distributions in these spaces, using stochastic observations
to drive the optimization.
Architecture: A Three-Layer Design
The power of kmeanssa-ng comes from its modular, extensible
architecture that cleanly separates concerns into three layers. This
design makes it possible to apply the same clustering algorithm to any
user-defined space.
Layer 1: The Metric Space (Space)
This layer defines the geometric context. It consists of: - A Space
object that defines the distance metric and knows how to create
points. - Point objects, which are the data points to be clustered. -
Center objects, which are the mobile cluster centers.
The Space is responsible for all geometric calculations: measuring
distances, moving centers, and sampling points.
Layer 2: The Algorithm (SimulatedAnnealing)
This layer contains the core optimization logic. The
SimulatedAnnealing class orchestrates the clustering process by
iteratively moving the Centers to minimize the k-means energy
function.
Crucially, the algorithm is completely independent of the space it
operates on. It interacts with the Space, Points, and Centers
only through the abstract interfaces they provide.
Layer 3: Concrete Implementations
This layer provides concrete implementations of the abstract Space,
Point, and Center classes for specific metric spaces. kmeanssa-ng
comes with a built-in implementation for Quantum
Graphs.
The key benefit of this architecture is its extensibility. Users can add
support for new metric spaces simply by providing their own concrete
implementation of the Space layer. See Custom
Spaces for a detailed guide.