Distance Metrics in Machine Learning

Distance Metrics in Machine Learning

Correlation measures only linear codependence, which can be misleading. Also, it does not satisfy properties of a metric, like nonnegativity and triangle inequality. A metric dρ[X,Y]=1/2(1ρ[X,Y])d_{\rho}[X, Y]=\sqrt{1 / 2(1-\rho[X, Y])} can be formed using correlation. This metric essentially inherits properties from Euclidean distance after z-standardization, making it a 'true metric'.

Another normalized correlation-based distance metric, dρ[X,Y]=1ρ[X,Y]d_{|\rho|}[X, Y]=\sqrt{1-|\rho[X, Y]|}, can also be defined. This metric is especially useful when you need to consider negative correlations as similar for particular applications.

dρ[X,Y]=12(1ρ[X,Y])d_{\rho}[X, Y] = \sqrt{\frac{1}{2}(1-\rho[X, Y])}
dρ[X,Y]=12(1ρ[X,Y])d_{|\rho|}[X, Y] = \sqrt{\frac{1}{2}(1-|\rho[X, Y]|)}
distance_metric.py
def calculate_distance(
    dependence: np.ndarray,
    metric: str = "angular"
) -> np.ndarray:
    if metric == "angular":
        distance = ((1 - dependence).round(5) / 2.) ** 0.5
    elif metric == "absolute_angular":
        distance = ((1 - abs(dependence)).round(5) / 2.) ** 0.5
    return distance

View More: Julia | Python

Information Theory: Marginal and Joint Entropy

Correlation has limitations: it neglects nonlinear relationships, is sensitive to outliers, and is mostly meaningful for normally-distributed variables. To address this, we can use Shannon's entropy, defined for a discrete random variable XX:

H[X]=xSXp[x]log[p[x]]H[X] = -\sum_{x \in S_{X}} p[x] \log [p[x]]

This entropy H[X]H[X] measures the amount of uncertainty or 'surprise' associated with XX.

For two discrete random variables X,YX, Y, the joint entropy is:

H[X,Y]=x,ySX×SYp[x,y]log[p[x,y]]H[X, Y] = -\sum_{x, y \in S_{X} \times S_{Y}} p[x, y] \log [p[x, y]]

Conditional Entropy

Conditional entropy H[XY]H[X|Y] measures the remaining uncertainty in XX when YY is known:

H[XY]=H[X,Y]H[Y]H[X \mid Y] = H[X, Y] - H[Y]

Divergence Measures: Kullback-Leibler and Cross-Entropy

Kullback-Leibler (KL) divergence quantifies how one probability distribution pp diverges from another qq:

DKL[pq]=xSXp[x]log[p[x]q[x]]D_{K L}[p \| q] = \sum_{x \in S_{X}} p[x] \log \left[\frac{p[x]}{q[x]}\right]

Cross-entropy HC[pq]H_C[p \| q] measures the information content using a wrong distribution qq rather than the true distribution pp:

HC[pq]=H[X]+DKL[pq]H_{C}[p \| q] = H[X] + D_{K L}[p \| q]
distance_metric.py
import numpy as np

def calculate_kullback_leibler_divergence(
    p: np.ndarray,
    q: np.ndarray
) -> float:
    divergence = -(p * np.log(q / p)).sum()
    return divergence


def calculate_cross_entropy(
    p: np.ndarray,
    q: np.ndarray
) -> float:
    entropy = -(p * np.log(q)).sum()
    return entropy

View More: Julia | Python

Mutual Information

The mutual information I[X;Y]I[X;Y] measures the amount of information XX and YY share:

I[X;Y]=H[X]+H[Y]H[X,Y]I[X; Y] = H[X] + H[Y] - H[X, Y]

It can be further generalized to a metric form using normalized mutual information NMINMI to fulfill metric properties.

NMI(X,Y)=I(X;Y)H(X)H(Y)NMI(X,Y) = \frac{I(X;Y)}{\sqrt{H(X)H(Y)}}

Conclusion

Both correlation and entropy-based measures have their places in modern applications. Correlation-based measures are computationally less demanding and have a long history in statistics. In contrast, entropy-based measures provide a comprehensive understanding of relationships between variables. Implementing these concepts can enhance your analytics and decision-making processes.

distance_metric.py
import numpy as np
from sklearn.metrics import mutual_info_score
import scipy.stats as ss

def calculate_number_of_bins(
    num_observations: int,
    correlation: float = None
) -> int:
    if correlation is None:
        z = (8 + 324 * num_observations + 12 * (36 * num_observations + 729 * num_observations ** 2) ** .5) ** (1 / 3.)
        bins = round(z / 6. + 2. / (3 * z) + 1. / 3)
    else:
        bins = round(2 ** -.5 * (1 + (1 + 24 * num_observations / (1. - correlation ** 2)) ** .5) ** .5)
    return int(bins)

def calculate_mutual_information(
    x: np.ndarray,
    y: np.ndarray,
    norm: bool = False
) -> float:
    num_bins = calculate_number_of_bins(x.shape[0], correlation=np.corrcoef(x, y)[0, 1])
    histogram_xy = np.histogram2d(x, y, num_bins)[0]
    mutual_information = mutual_info_score(None, None, contingency=histogram_xy)

    if norm:
        marginal_x = ss.entropy(np.histogram(x, num_bins)[0])
        marginal_y = ss.entropy(np.histogram(y, num_bins)[0])
        mutual_information /= min(marginal_x, marginal_y)

    return mutual_information

View More: Julia | Python

Variation of Information: A Simplified Guide

Understanding Variation of Information

Variation of Information (VI) measures how much one variable tells us about another. It has two terms:

  • Uncertainty in XX given YY: H[XY]H[X|Y]
  • Uncertainty in YY given XX: H[YX]H[Y|X]

So, the formula becomes:

VI[X,Y]=H[XY]+H[YX]VI[X, Y] = H[X|Y] + H[Y|X]

We can also express it using other measures like Mutual Information (II) and joint entropy (H[X,Y]H[X, Y]):

VI[X,Y]=2H[X,Y]H[X]H[Y]VI[X, Y] = 2 H[X, Y] - H[X] - H[Y]

or

VI[X,Y]=H[X,Y]I[X,Y]VI[X, Y] = H[X, Y] - I[X, Y]

Normalized VI

To compare VI across varying population sizes, we can normalize it:

VI~[X,Y]=VI[X,Y]H[X,Y]\widetilde{VI}[X, Y] = \frac{VI[X, Y]}{H[X, Y]}

An alternative normalized metric is:

VI~~[X,Y]=1I[X,Y]max{H[X],H[Y]}\tilde{\tilde{VI}}[X, Y] = 1-\frac{I[X, Y]}{\max\{H[X], H[Y]\}}

Calculating Basic Measures: Entropy and Mutual Information

These functionalities are available in both Python and Julia in the RiskLabAI library.

Continuous Variables and Discretization

For continuous variables, the entropy is calculated using integration. But in practice, we discretize the continuous data into bins to approximate entropy. For a Gaussian variable:

H[X]=12log[2πeσ2]H[X] = \frac{1}{2} \log \left[2 \pi e \sigma^{2}\right]

The entropy can be estimated as:

H^[X]=i=1BXNiNlog[NiN]+log[ΔX]\hat{H}[X]=-\sum_{i=1}^{B_{X}} \frac{N_{i}}{N} \log \left[\frac{N_{i}}{N}\right]+\log \left[\Delta_{X}\right]

For optimal binning, we can use formulas derived by Hacine-Gharbi and others. These vary depending on whether you are looking at the marginal entropy or the joint entropy.

Calculating Variation of Information with Optimal Binning

distance_metric.py
import numpy as np
import scipy.stats as ss
from sklearn.metrics import mutual_info_score

def calculate_number_of_bins(
    num_observations: int,
    correlation: float = None
) -> int:
    if correlation is None:
        z = (8 + 324 * num_observations + 12 * (36 * num_observations + 729 * num_observations ** 2) ** .5) ** (1 / 3.)
        bins = round(z / 6. + 2. / (3 * z) + 1. / 3)
    else:
        bins = round(2 ** -.5 * (1 + (1 + 24 * num_observations / (1. - correlation ** 2)) ** .5) ** .5)
    return int(bins)

def calculate_variation_of_information_extended(
    x: np.ndarray,
    y: np.ndarray,
    norm: bool = False
) -> float:
    num_bins = calculate_number_of_bins(x.shape[0], correlation=np.corrcoef(x, y)[0, 1])
    histogram_xy = np.histogram2d(x, y, num_bins)[0]
    mutual_information = mutual_info_score(None, None, contingency=histogram_xy)
    marginal_x = ss.entropy(np.histogram(x, num_bins)[0])
    marginal_y = ss.entropy(np.histogram(y, num_bins)[0])
    variation_xy = marginal_x + marginal_y - 2 * mutual_information

    if norm:
        joint_xy = marginal_x + marginal_y - mutual_information
        variation_xy /= joint_xy

    return variation_xy

View More: Julia | Python

The code above shows how to calculate the Variation of Information with optimal binning in both Python and Julia. If you set norm=True, you'll get the normalized value.

Understanding Partitions in Data Sets

A partition, denoted as PP, is a way to divide a dataset, DD, into non-overlapping subsets. Mathematically, these subsets DkD_k follow three main properties:

  • Every subset contains at least one element, i.e., Dk>0\|D_k\| > 0.
  • Subsets do not overlap, i.e., DkDl=D_k \cap D_l = \varnothing, for klk \neq l.
  • Together, the subsets cover the entire dataset, i.e., k=1KDk=D\bigcup_{k=1}^{K} D_k = D.

Metrics for Comparing Partitions

We define the uncertainty or randomness associated with a partition PP in terms of entropy, given by:

H[P]=k=1Kp[k]log[p[k]]H[P] = -\sum_{k=1}^{K} p[k] \log [p[k]]

Here, p[k]p[k] is the probability of a randomly chosen element from DD belonging to the subset DkD_k.

If we have another partition PP', we can define several metrics like joint entropy, conditional entropy, mutual information, and variation of information to compare the two partitions. These metrics provide a way to measure the similarity or dissimilarity between two different divisions of the same dataset.

Applications in Machine Learning

Variation of information is particularly useful in unsupervised learning to compare the output from clustering algorithms. It offers a normalized way to compare partitioning methods across various datasets.

Experimental Results Summarized

  • No Relationship: When there's no relation between two variables, both correlation and normalized mutual information are close to zero.
  • Linear Relationship: For a strong linear relationship, both metrics are high but the mutual information is slightly less than 1 due to some uncertainty.
  • Nonlinear Relationship: In this case, correlation fails to capture the relationship, but normalized mutual information reveals a substantial amount of shared information between the variables.

Figures:

2
3
4

References

  1. De Prado, M. L. (2018). Advances in financial machine learning. John Wiley & Sons.
  2. De Prado, M. M. L. (2020). Machine learning for asset managers. Cambridge University Press.