Table of Contents

Testing Set Overfitting

This chapter explains why most backtested strategies in finance are false discoveries. The core problem is Selection Bias under Multiple Testing (SBuMT), also known as "backtest overfitting."

A backtest is a historical simulation, not a controlled experiment. Researchers (and firms) run thousands of trials and only present the best-performing ones. This process of "cherry-picking" inflates performance and leads to strategies that fail in live trading. The chapter provides a framework for quantifying this inflation.

The Problem in Terms of Precision and Recall

The reliability of a "significant" backtest depends on the "prior" probability that a strategy is truly profitable.

Let $\theta = s_T / s_F$ be the odds ratio of true strategies ( $s_T$ ) to false strategies ( $s_F$ ). In finance, $\theta$ is very low.
Let $\alpha$ be the false positive rate (Type I error) and $\beta$ be the false negative rate (Type II error).

The precision (the probability that a strategy proven significant is actually true) is:

\text { precision }=\frac{(1-\beta) \theta}{(1-\beta) \theta+\alpha}

Key Insight: Even with a low $\alpha$ (e.g., p-value of 0.05), if the odds $\theta$ are tiny (e.g., 1/99), the precision will be extremely low. The text calculates that a 5% p-value could imply an 86% false discovery rate.

Multiple Testing and Error Rates

When $K$ independent trials are run, the error rates are compounded:

Familywise Error Rate (FWER): The probability of getting at least one false positive. $\alpha_{K}=1-(1-\alpha)^{K}$
Familywise Miss Rate: The probability of missing all true positives. $\beta_{K}=\beta^{K}$

This makes the adjusted precision even worse:

\text { precision }=\frac{\left(1-\beta^{K}\right) \theta}{\left(1-\beta^{K}\right) \theta+1-(1-\alpha)^{K}}

The Sharpe Ratio (SR) and SBuMT

The chapter provides a framework for correcting the Sharpe Ratio (SR) for selection bias.

1. The Distribution of the Sharpe Ratio

The estimated SR ( $\widehat{\text{SR}}$ ) is asymptotically Normal, even if returns are non-Normal. However, its variance depends on the returns' skewness ( $\gamma_3$ ) and kurtosis ( $\gamma_4$ ).

Mertens' Asymptotic Distribution: $(\widehat{\mathrm{SR}}-\mathrm{SR}) \stackrel{a}{\rightarrow} \mathcal{N}\left[0, \frac{1+\frac{1}{2} \mathrm{SR}^{2}-\gamma_{3} \mathrm{SR}+\frac{\gamma_{4}-3}{4} \mathrm{SR}^{2}}{T}\right]$

2. The "False Strategy" Theorem

This theorem estimates the Expected Maximum Sharpe Ratio $\mathrm{E}[\max_{k}\{\widehat{\mathrm{SR}}_{k}\}]$ that a researcher would get by chance after running $K$ trials of a false strategy (where the true SR is 0).

Theorem Equation:
$\mathrm{E}\left[\max _{k}\left\{\widehat{\mathrm{SR}}_{k}\right\}\right] \approx \sqrt{\mathrm{V}\left[\left\{\widehat{\mathrm{SR}}_{k}\right\}\right]} \left((1-\gamma) Z^{-1}\left[1-\frac{1}{K}\right]+\gamma Z^{-1}\left[1-\frac{1}{K e}\right]\right)$
(where $\gamma$ is the Euler-Mascheroni constant and $Z^{-1}$ is the inverse Gaussian CDF).
Implication: A researcher running 1,000 trials of a random strategy (true $\text{SR}=0$ ) will expect to find a "winner" with $\text{SR} \approx 3.26$ .

Solutions for Quantifying Overfitting

The chapter provides two methods to test if a strategy's $\widehat{\text{SR}}$ is truly significant or just the result of multiple testing.

Step 1: Find the Effective Number of Trials ( $K$ ) A researcher may run 1,000 backtests, but many are correlated (not independent). We must find the effective number of independent trials, $\mathrm{E}[K]$ .

Solution: Use an ML clustering algorithm (like ONC) on the return series of all 1,000 backtests. The resulting number of clusters is the effective number of trials, $\mathrm{E}[K]$ .

Step 2 (Option A): The Deflated Sharpe Ratio (DSR) The DSR recalculates the p-value of the $\widehat{\text{SR}}$ by testing it against the correct null hypothesis (the expected maximum from the False Strategy Theorem) and adjusting for non-Normal returns.

DSR Equation: $\widehat{\mathrm{DSR}}=Z\left[\frac{\left(\widehat{\mathrm{SR}}-\mathrm{E}\left[\max _{k}\left\{\widehat{\mathrm{SR}}_{k}\right\}\right]\right) \sqrt{T-1}}{\sqrt{1-\hat{\gamma}_{3} \widehat{\mathrm{SR}}+\frac{\hat{\gamma}_{4}-1}{4} \widehat{\mathrm{SR}}^{2}}}\right]$

Step 2 (Option B): The Familywise Error Rate (FWER) for SR This approach calculates the true p-value ( $\alpha_K$ ) of the observed $\widehat{\text{SR}}$ , given the effective number of trials $\mathrm{E}[K]$ .

First, calculate the z-statistic for the observed $\widehat{\text{SR}}$ assuming the true $\text{SR}=0$ (using the Mertens distribution): $\hat{z}[0]$ .
Find the single-test p-value: $\alpha=1-Z[\hat{z}[0]]$ .
Apply the FWER formula using the estimated $\mathrm{E}[K]$ from clustering: $\alpha_{K}=1-(1-\alpha)^{\mathrm{E}[K]} = 1-Z[\hat{z}[0]]^{\mathrm{E}[K]}$

This $\alpha_K$ is the actual probability that the "discovered" strategy is a false positive.

API reference

RiskLabAI implements these in Python and Julia (signatures auto-generated from the package source):

Python	Julia
`def probabilistic_sharpe_ratio( observed_sharpe_ratio: float, benchmark_sharpe_ratio: float, number_of_returns: int, skewness_of_returns: float = 0.0, kurtosis_of_returns: float = 3.0, return_test_statistic: bool = False, ) -> float:`	`function probabilistic_sharpe_ratio( observed_sharpe_ratio::Real, benchmark_sharpe_ratio::Real, number_of_returns::Integer; skewness_of_returns::Real = 0.0, kurtosis_of_returns::Real = 3.0, return_test_statistic::Bool = false, )`
`def benchmark_sharpe_ratio(sharpe_ratio_estimates: list[float]) -> float:`	`function benchmark_sharpe_ratio(sharpe_ratio_estimates::AbstractVector{<:Real})`
`def expected_max_sharpe_ratio( n_trials: int, mean_sharpe_ratio: float, std_sharpe_ratio: float ) -> float:`	`function expected_max_sharpe_ratio( n_trials::Integer, mean_sharpe_ratio::Real, std_sharpe_ratio::Real, )`
`def estimated_sharpe_ratio_z_statistics( sharpe_ratio: float, t: int, true_sharpe_ratio: float = 0.0, skew: float = 0.0, kurt: int = 3, ) -> float:`	`function estimated_sharpe_ratio_z_statistics( sharpe_ratio::Real, t::Integer; true_sharpe_ratio::Real = 0.0, skew::Real = 0.0, kurt::Real = 3, )`
`def strategy_type1_error_probability(z: float, k: int = 1) -> float:`	`function strategy_type1_error_probability(z::Real; k::Integer = 1)`
`def theta_for_type2_error( sharpe_ratio: float, t: int, true_sharpe_ratio: float, skew: float = 0.0, kurt: int = 3, ) -> float:`	`function theta_for_type2_error( sharpe_ratio::Real, t::Integer, true_sharpe_ratio::Real; skew::Real = 0.0, kurt::Real = 3, )`
`def strategy_type2_error_probability(alpha_k: float, k: int, theta: float) -> float:`	`function strategy_type2_error_probability(alpha_k::Real, k::Integer, theta::Real)`
`def generate_max_sharpe_ratios( n_sims: int, n_trials_list: list[int], std_sharpe_ratio: float, mean_sharpe_ratio: float, ) -> pd.DataFrame:`	`function generate_max_sharpe_ratios( n_sims::Integer, n_trials_list::AbstractVector{<:Integer}, std_sharpe_ratio::Real, mean_sharpe_ratio::Real; rng::AbstractRNG = Random.default_rng(), )`

Full source: Python · Julia