Table of Contents

Optimal Clustering (ONC)

This chapter tackles the problem of optimal clustering, which is a form of unsupervised learning. The goal is to find both the ideal composition and number of clusters in a dataset, a common task in finance for risk management, relative value analysis, and understanding market regimes.

Proximity Matrix and Clustering Types

Proximity Matrix: A clustering algorithm first needs an $N \times N$ matrix that measures the "proximity" (similarity or dissimilarity) between all $N$ objects.
- This can be a correlation matrix, but it's better to use a true distance metric like $d_{\rho}[X, Y]=\sqrt{\frac{1}{2}(1-\rho[X, Y])}$ (as derived in Section 3).
Types of Clustering: The text identifies two main classes:
1. Partitional: Creates a single level of un-nested groups (e.g., k-means).
2. Hierarchical: Creates a nested tree of clusters, from singletons at the bottom to one all-inclusive cluster at the top.
The Curse of Dimensionality: If the number of features ( $F$ ) is much larger than the number of observations ( $N$ ), the data becomes too sparse to cluster. This can be solved by first reducing the features using a method like PCA and using a signal-processing technique (like the Marcenko-Pastur theorem from Section 2) to select the optimal number of components.

ONC: Optimal Number of Clusters

A key problem with partitional algorithms like k-means is that the user must guess the number of clusters ( $K$ ). The "elbow method" is a common but arbitrary way to solve this.

The chapter presents the Optimal Number of Clusters (ONC) algorithm, which is a modified version of k-means that automatically finds the best $K$ . It makes three key modifications:

1. An Objective Function (Silhouette Score)

Instead of just minimizing variance, ONC seeks to maximize the quality of the clusters. This is measured by the silhouette score, which balances intra-cluster cohesion and inter-cluster separation.

Silhouette Coefficient ( $S_i$ ): For each point $i$ , this score compares its average distance to its own cluster ( $a_i$ ) with its average distance to the nearest other cluster ( $b_i$ ). $S_{i}=\frac{b_{i}-a_{i}}{\max \left\{a_{i}, b_{i}\right\}}$
Clustering Quality ( $q$ ): The algorithm maximizes the t-statistic of the silhouette scores (mean / std. dev.), rewarding clusters that are both well-separated (high mean) and of consistent quality (low variance). $q=\frac{\mathrm{E}\left[\left\{S_{i}\right\}\right]}{\sqrt{\mathrm{V}\left[\left\{S_{i}\right\}\right]}}$

2. Solving the Initialization Problem

Standard k-means is sensitive to its random starting points. The ONC "Base Clustering" stage solves this by running k-means multiple times with different initializations (n_init) across a range of possible $K$ values (e.g., $K=2$ to $N$ ). It keeps only the (K, initialization) combination that produces the highest overall quality score $q$ .

The "Base Clustering" might find some good clusters and some bad ones. The "Higher-Level" stage refines the result:

It finds all clusters from the base step with below-average quality ( $q_k < \bar{q}$ ).
It groups all points from these "bad" clusters into a new, reduced observations matrix.
It re-runs the full "Base Clustering" (Step 1 & 2) only on this "bad" subset.
If the new, refined clustering has a better average quality, the algorithm accepts the refinement. This allows ONC to find distinct clusters while iteratively breaking down and re-evaluating the "messy" parts of the data.

Experimental Results

To test ONC, the authors:

Created random, block-diagonal correlation matrices with a known number of clusters ( $K$ ).
Shuffled these matrices to hide the structure.
Ran the ONC algorithm to see if it could recover the original, correct $K$ .

The results show that ONC effectively recovers the true number of clusters, with the ratio of Estimated K / Actual K being very close to 1 across simulations.

API reference

RiskLabAI implements these in Python and Julia (signatures auto-generated from the package source):

Python	Julia
`def cluster_k_means_base( correlation: pd.DataFrame, max_clusters: int = 10, iterations: int = 10, random_state: Optional[int] = None, ) -> tuple[pd.DataFrame, dict[int, list[str]], pd.Series]:`	`function cluster_k_means_base( correlation::AbstractMatrix{<:Real}; max_clusters::Integer = 10, iterations::Integer = 10, random_state = nothing, )`
`def cluster_k_means_top( correlation: pd.DataFrame, max_clusters: Optional[int] = None, iterations: int = 10, random_state: Optional[int] = None, ) -> tuple[pd.DataFrame, dict[int, list[str]], pd.Series]:`	`function cluster_k_means_top( correlation::AbstractMatrix{<:Real}; max_clusters = nothing, iterations::Integer = 10, random_state = nothing, )`
`def make_new_outputs( correlation: pd.DataFrame, clusters_1: dict[int, list[str]], clusters_2: dict[int, list[str]], ) -> tuple[pd.DataFrame, dict[int, list[str]], pd.Series]:`	`function make_new_outputs( correlation::AbstractMatrix{<:Real}, clusters_1::AbstractDict, clusters_2::AbstractDict, )`
`def random_covariance_sub( n_observations: int, n_columns: int, sigma: float, random_state: Optional[int] = None, ) -> np.ndarray:`	`function random_covariance_sub( n_observations::Integer, n_columns::Integer, sigma::Real; random_state = nothing, )`
`def random_block_covariance( n_columns: int, n_blocks: int, block_size_min: int = 1, sigma: float = 1.0, random_state: Optional[int] = None, ) -> np.ndarray:`	`function random_block_covariance( n_columns::Integer, n_blocks::Integer; block_size_min::Integer = 1, sigma::Real = 1.0, random_state = nothing, )`
`def random_block_correlation( n_columns: int, n_blocks: int, random_state: Optional[int] = None, block_size_min: int = 1, ) -> pd.DataFrame:`	`function random_block_correlation( n_columns::Integer, n_blocks::Integer; random_state = nothing, block_size_min::Integer = 1, )`

Full source: Python · Julia