Published on

Optimal Clustering (ONC)

Authors
Table of Contents

Optimal Clustering (ONC)

This chapter tackles the problem of optimal clustering, which is a form of unsupervised learning. The goal is to find both the ideal composition and number of clusters in a dataset, a common task in finance for risk management, relative value analysis, and understanding market regimes.


Proximity Matrix and Clustering Types

  • Proximity Matrix: A clustering algorithm first needs an N×NN \times N matrix that measures the "proximity" (similarity or dissimilarity) between all NN objects.
    • This can be a correlation matrix, but it's better to use a true distance metric like dρ[X,Y]=12(1ρ[X,Y])d_{\rho}[X, Y]=\sqrt{\frac{1}{2}(1-\rho[X, Y])} (as derived in Section 3).
  • Types of Clustering: The text identifies two main classes:
    1. Partitional: Creates a single level of un-nested groups (e.g., k-means).
    2. Hierarchical: Creates a nested tree of clusters, from singletons at the bottom to one all-inclusive cluster at the top.
  • The Curse of Dimensionality: If the number of features (FF) is much larger than the number of observations (NN), the data becomes too sparse to cluster. This can be solved by first reducing the features using a method like PCA and using a signal-processing technique (like the Marcenko-Pastur theorem from Section 2) to select the optimal number of components.

ONC: Optimal Number of Clusters

A key problem with partitional algorithms like k-means is that the user must guess the number of clusters (KK). The "elbow method" is a common but arbitrary way to solve this.

The chapter presents the Optimal Number of Clusters (ONC) algorithm, which is a modified version of k-means that automatically finds the best KK. It makes three key modifications:

1. An Objective Function (Silhouette Score)

Instead of just minimizing variance, ONC seeks to maximize the quality of the clusters. This is measured by the silhouette score, which balances intra-cluster cohesion and inter-cluster separation.

  • Silhouette Coefficient (SiS_i): For each point ii, this score compares its average distance to its own cluster (aia_i) with its average distance to the nearest other cluster (bib_i).
    Si=biaimax{ai,bi}S_{i}=\frac{b_{i}-a_{i}}{\max \left\{a_{i}, b_{i}\right\}}
  • Clustering Quality (qq): The algorithm maximizes the t-statistic of the silhouette scores (mean / std. dev.), rewarding clusters that are both well-separated (high mean) and of consistent quality (low variance).
    q=E[{Si}]V[{Si}]q=\frac{\mathrm{E}\left[\left\{S_{i}\right\}\right]}{\sqrt{\mathrm{V}\left[\left\{S_{i}\right\}\right]}}

2. Solving the Initialization Problem

Standard k-means is sensitive to its random starting points. The ONC "Base Clustering" stage solves this by running k-means multiple times with different initializations (n_init) across a range of possible KK values (e.g., K=2K=2 to NN). It keeps only the (K, initialization) combination that produces the highest overall quality score qq.

3. Higher-Level Clustering (Recursive Refinement)

The "Base Clustering" might find some good clusters and some bad ones. The "Higher-Level" stage refines the result:

  1. It finds all clusters from the base step with below-average quality (qk<qˉq_k < \bar{q}).
  2. It groups all points from these "bad" clusters into a new, reduced observations matrix.
  3. It re-runs the full "Base Clustering" (Step 1 & 2) only on this "bad" subset.
  4. If the new, refined clustering has a better average quality, the algorithm accepts the refinement. This allows ONC to find distinct clusters while iteratively breaking down and re-evaluating the "messy" parts of the data.

Experimental Results

To test ONC, the authors:

  1. Created random, block-diagonal correlation matrices with a known number of clusters (KK).
  2. Shuffled these matrices to hide the structure.
  3. Ran the ONC algorithm to see if it could recover the original, correct KK.

The results show that ONC effectively recovers the true number of clusters, with the ratio of Estimated K / Actual K being very close to 1 across simulations.

API reference

RiskLabAI implements these in Python and Julia (signatures auto-generated from the package source):

PythonJulia
def cluster_k_means_base(
    correlation: pd.DataFrame,
    max_clusters: int = 10,
    iterations: int = 10,
    random_state: Optional[int] = None,
) -> tuple[pd.DataFrame, dict[int, list[str]], pd.Series]:
function cluster_k_means_base(
    correlation::AbstractMatrix{<:Real};
    max_clusters::Integer = 10,
    iterations::Integer = 10,
    random_state = nothing,
)
def cluster_k_means_top(
    correlation: pd.DataFrame,
    max_clusters: Optional[int] = None,
    iterations: int = 10,
    random_state: Optional[int] = None,
) -> tuple[pd.DataFrame, dict[int, list[str]], pd.Series]:
function cluster_k_means_top(
    correlation::AbstractMatrix{<:Real};
    max_clusters = nothing,
    iterations::Integer = 10,
    random_state = nothing,
)
def make_new_outputs(
    correlation: pd.DataFrame,
    clusters_1: dict[int, list[str]],
    clusters_2: dict[int, list[str]],
) -> tuple[pd.DataFrame, dict[int, list[str]], pd.Series]:
function make_new_outputs(
    correlation::AbstractMatrix{<:Real},
    clusters_1::AbstractDict,
    clusters_2::AbstractDict,
)
def random_covariance_sub(
    n_observations: int,
    n_columns: int,
    sigma: float,
    random_state: Optional[int] = None,
) -> np.ndarray:
function random_covariance_sub(
    n_observations::Integer,
    n_columns::Integer,
    sigma::Real;
    random_state = nothing,
)
def random_block_covariance(
    n_columns: int,
    n_blocks: int,
    block_size_min: int = 1,
    sigma: float = 1.0,
    random_state: Optional[int] = None,
) -> np.ndarray:
function random_block_covariance(
    n_columns::Integer,
    n_blocks::Integer;
    block_size_min::Integer = 1,
    sigma::Real = 1.0,
    random_state = nothing,
)
def random_block_correlation(
    n_columns: int,
    n_blocks: int,
    random_state: Optional[int] = None,
    block_size_min: int = 1,
) -> pd.DataFrame:
function random_block_correlation(
    n_columns::Integer,
    n_blocks::Integer;
    random_state = nothing,
    block_size_min::Integer = 1,
)

Full source: Python · Julia