- Published on
Optimal Clustering (ONC)
- Authors

- Name
- Tails Azimuth
Table of Contents
Optimal Clustering (ONC)
This chapter tackles the problem of optimal clustering, which is a form of unsupervised learning. The goal is to find both the ideal composition and number of clusters in a dataset, a common task in finance for risk management, relative value analysis, and understanding market regimes.
Proximity Matrix and Clustering Types
- Proximity Matrix: A clustering algorithm first needs an matrix that measures the "proximity" (similarity or dissimilarity) between all objects.
- This can be a correlation matrix, but it's better to use a true distance metric like (as derived in Section 3).
- Types of Clustering: The text identifies two main classes:
- Partitional: Creates a single level of un-nested groups (e.g., k-means).
- Hierarchical: Creates a nested tree of clusters, from singletons at the bottom to one all-inclusive cluster at the top.
- The Curse of Dimensionality: If the number of features () is much larger than the number of observations (), the data becomes too sparse to cluster. This can be solved by first reducing the features using a method like PCA and using a signal-processing technique (like the Marcenko-Pastur theorem from Section 2) to select the optimal number of components.
ONC: Optimal Number of Clusters
A key problem with partitional algorithms like k-means is that the user must guess the number of clusters (). The "elbow method" is a common but arbitrary way to solve this.
The chapter presents the Optimal Number of Clusters (ONC) algorithm, which is a modified version of k-means that automatically finds the best . It makes three key modifications:
1. An Objective Function (Silhouette Score)
Instead of just minimizing variance, ONC seeks to maximize the quality of the clusters. This is measured by the silhouette score, which balances intra-cluster cohesion and inter-cluster separation.
- Silhouette Coefficient (): For each point , this score compares its average distance to its own cluster () with its average distance to the nearest other cluster ().
- Clustering Quality (): The algorithm maximizes the t-statistic of the silhouette scores (mean / std. dev.), rewarding clusters that are both well-separated (high mean) and of consistent quality (low variance).
2. Solving the Initialization Problem
Standard k-means is sensitive to its random starting points. The ONC "Base Clustering" stage solves this by running k-means multiple times with different initializations (n_init) across a range of possible values (e.g., to ). It keeps only the (K, initialization) combination that produces the highest overall quality score .
3. Higher-Level Clustering (Recursive Refinement)
The "Base Clustering" might find some good clusters and some bad ones. The "Higher-Level" stage refines the result:
- It finds all clusters from the base step with below-average quality ().
- It groups all points from these "bad" clusters into a new, reduced observations matrix.
- It re-runs the full "Base Clustering" (Step 1 & 2) only on this "bad" subset.
- If the new, refined clustering has a better average quality, the algorithm accepts the refinement. This allows ONC to find distinct clusters while iteratively breaking down and re-evaluating the "messy" parts of the data.
Experimental Results
To test ONC, the authors:
- Created random, block-diagonal correlation matrices with a known number of clusters ().
- Shuffled these matrices to hide the structure.
- Ran the ONC algorithm to see if it could recover the original, correct .
The results show that ONC effectively recovers the true number of clusters, with the ratio of Estimated K / Actual K being very close to 1 across simulations.
API reference
RiskLabAI implements these in Python and Julia (signatures auto-generated from the package source):
| Python | Julia |
|---|---|
| |
| |
| |
| |
| |
| |