Table of Contents

Financial Labels (Trend Scanning)

This chapter explains why labeling is a critical step in supervised machine learning for finance. The way labels (the $y$ variable) are defined determines the exact task the algorithm will learn. Poor labeling can lead an ML model to fail, even if the features ( $X$ ) are predictive. The chapter discusses four common labeling strategies.

Fixed-Horizon Method

This is the most common method in academic literature, but it is highly flawed. It assigns a label based on a price return ( $r$ ) crossing a fixed threshold ( $\tau$ ) after a fixed time horizon ( $h$ ).

Labeling Equation: $y_{i}=\left\{\begin{aligned} -1 & \text { if } r_{t_{i, 0}, t_{i, 1}}<-\tau, \\ 0 & \text { if } \|r_{t_{i, 0}, t_{i, 1}}\| \leq \tau \\ 1 & \text { if } r_{t_{i, 0}, t_{i, 1}}>\tau \end{aligned}\right.$
Key Flaws:
1. Heteroscedasticity: When used with time bars, the fixed threshold $\tau$ ignores the fact that volatility changes (e.g., higher at the open). This creates a non-stationary (unreliable) label distribution.
2. No Path-Dependency: It ignores how the price got to the end point. A real strategy might have been stopped out long before the horizon $h$ was reached.
3. Impractical: Investors rarely care about a price at a specific future time, but rather if a move will happen within a certain window.

Triple-Barrier Method

This is a more realistic method that simulates an actual trading strategy. A label is assigned based on the first of three barriers to be touched:

Upper Barrier: A profit-taking target (Label 1).
Lower Barrier: A stop-loss limit (Label -1).
Vertical Barrier: A maximum holding period (e.g., $h$ bars). If this is hit first, the label is either 0 or the sign of the return at that time.

Advantages:
- It is path-dependent, accurately reflecting that a position can be stopped out.
- It directly models the P&L of a trade, making the labels relevant to a real-world investment.

Trend-Scanning Method

This is a novel method that avoids setting arbitrary barriers. Instead, it labels every point based on the most statistically significant trend that starts after it.

For an observation $x_t$ , it runs multiple linear regressions over different future look-forward periods $L$ (e.g., $L=10$ bars, $L=11$ bars, ...). $x_{t+l} =\beta_{0}+\beta_{1} l+\varepsilon_{t+l}$
It calculates the t-value for the trend coefficient ( $\hat{t}_{\hat{\beta}_{1}}$ ) for each look-forward period $L$ .
The strongest trend (the one with the maximum absolute t-value, max(|tVal|) ) is selected.
The observation is labeled with the sign of that trend (e.g., bin = sgn(tVal)).

Advantages:
- It lets the data define the trend's duration, rather than a fixed $h$ .
- It produces both a binary label (for classification) and a t-value (for regression or weighting), which captures the strength of the trend.

Meta-Labeling

This technique is used to determine bet size and reduce false positives, not to determine the side of the trade.

Process:
1. A Primary Model (which can be any model, e.g., trend-scanning) determines the side (buy/sell).
2. A Secondary Model (the "meta-model") is trained only on the primary model's predictions. Its goal is to predict the probability of success.
  - True Positives (wins) are labeled 1.
  - False Positives (losses) are labeled 0.

The output of the meta-model is a probability $p$ that the primary model's signal is correct. This $p$ can then be used to size the bet.

Bet Sizing from Probability: The probability $p$ is converted into a z-score (or t-statistic for an ensemble) to measure its confidence relative to a random guess ( $p=0.5$ ). This z-score is then mapped to a bet size $m$ .
- Sharpe Ratio of a Bet: $z=\frac{\mu}{\sigma}=\frac{p-0.5}{\sqrt{p(1-p)}}$
- Bet Size: $m = 2Z[z] - 1$ , where $Z$ is the Gaussian CDF.

API reference

RiskLabAI implements these in Python and Julia (signatures auto-generated from the package source):

Python	Julia
`def symmetric_cusum_filter(prices: pd.Series, threshold: float) -> pd.DatetimeIndex:`	`function symmetric_cusum_filter( close_index::AbstractVector, close::AbstractVector{<:Real}, threshold::Real, )`
`def daily_volatility_with_log_returns(close: pd.Series, span: int = 100) -> pd.Series:`	`function daily_volatility_with_log_returns( close_index::AbstractVector, close::AbstractVector{<:Real}; span::Integer = 100, )`
`def vertical_barrier( close: pd.Series, time_events: pd.DatetimeIndex, number_days: int ) -> pd.Series:`	`function vertical_barrier( close_index::AbstractVector, time_events::AbstractVector, number_days::Integer, )`
`def triple_barrier( close: pd.Series, events: pd.DataFrame, ptsl: list[float], molecule: list[pd.Timestamp], ) -> pd.DataFrame:`	`function triple_barrier( close_index::AbstractVector, close::AbstractVector{<:Real}, events::DataFrame, ptsl, )`
`def meta_events( close: pd.Series, time_events: pd.DatetimeIndex, ptsl: list[float], target: pd.Series, return_min: float, num_threads: int, vertical_barrier_times: Optional[pd.Series] = None, side: Optional[pd.Series] = None, ) -> pd.DataFrame:`	`function meta_events( close_index::AbstractVector, close::AbstractVector{<:Real}, time_events::AbstractVector, ptsl, target::AbstractDict, return_min::Real; vertical_barriers::Union{Nothing,AbstractDict} = nothing, side::Union{Nothing,AbstractDict} = nothing, )`
`def meta_labeling(events: pd.DataFrame, close: pd.Series) -> pd.DataFrame:`	`function meta_labeling( events::DataFrame, close_index::AbstractVector, close::AbstractVector{<:Real}, )`
`def calculate_t_value_linear_regression(prices: pd.Series) -> float:`	`function calculate_t_value_linear_regression(prices::AbstractVector{<:Real})`
`def find_trend_using_trend_scanning( molecule: pd.Index, close: pd.Series, span: tuple[int, int] ) -> pd.DataFrame:`	`function find_trend_using_trend_scanning( molecule::AbstractVector, close_index::AbstractVector, close::AbstractVector{<:Real}, span::Tuple{Integer,Integer}, )`

Full source: Python · Julia