Published on

Hyper-Parameter Tuning with Cross-Validation

Hyper-Parameter Tuning with Cross-Validation

Importance of Hyper-Parameter Tuning

Hyperparameter tuning is crucial for optimizing machine learning (ML) algorithms. Effective tuning results in improved real-world performance. Cross-validation (CV) plays a vital role in this, especially in the finance sector, where conventional approaches often fall short. This blog focuses on utilizing the Purged k-fold CV method for hyper-parameter optimization.

Purged-Kfold Integration into MLJBase

For hyperparameter tuning, grid search is often an initial step to understand the data's underlying structure. In MLJBase, GridSearchcV uses a CV generator, and to avoid overfitting, our PurgedKFold class can be passed as an argument.

hyper_parameter_tuning.py
def clf_hyper_fit(
    feature_data: pd.DataFrame,
    label: pd.DataFrame,
    times: pd.Series,
    pipe_clf: Pipeline,
    param_grid: dict,
    validator_type: str = 'purgedkfold',
    validator_params: dict = None,
    bagging: list = [0, -1, 1.],
    rnd_search_iter: int = 0,
    n_jobs: int = -1,
    **fit_params
) -> MyPipeline:
    if set(label.values) == {0, 1}:
        scoring = 'f1'  # F1-score for meta-labeling
    else:
        scoring = 'neg_log_loss'  # Symmetric towards all cases

    if validator_params is None:
        validator_params = {
            'times' : times,
            'n_splits' : 5,
            'embargo' : 0.01,
        }

    # Hyperparameter search on train data
    inner_cv =  CrossValidatorController(
        validator_type,
        **validator_params
    ).cross_validator
    
    if rnd_search_iter == 0:
        gs = GridSearchCV(estimator=pipe_clf, param_grid=param_grid, scoring=scoring, cv=inner_cv, n_jobs=n_jobs)
    else:
        gs = RandomizedSearchCV(estimator=pipe_clf, param_distributions=param_grid, scoring=scoring,
                                cv=inner_cv, n_jobs=n_jobs, n_iter=rnd_search_iter)
    gs = gs.fit(feature_data, label, **fit_params).best_estimator_  # Pipeline

    # Fit validated model on the entirety of the data
    if bagging[1] > 0:
        gs = BaggingClassifier(base_estimator=MyPipeline(gs.steps), n_estimators=int(bagging[0]),
                               max_samples=float(bagging[1]), max_features=float(bagging[2]), n_jobs=n_jobs)
        gs = gs.fit(feature_data, label, sample_weight=fit_params[gs.base_estimator.steps[-1][0] + '__sample_weight'])
        gs = Pipeline([('bag', gs)])

    return gs

View More: Julia | Python

Non-Negative Parameters

Non-negative hyperparameters are common in some ML algorithms, such as the SVC classifier and RBF kernel. Rather than using a uniform distribution for sampling, using a log-uniform distribution is often more effective for such parameters.

For a variable xx to have a log-uniform distribution between a>0a>0 and b>ab>a, its CDF and PDF can be defined as:

F[x]={log[x]log[a]log[b]log[a]for axb0for x<a1for x>bF[x]=\left\{\begin{array}{cl} \frac{\log[x]-\log[a]}{\log[b]-\log[a]} & \text{for } a \leq x \leq b \\ 0 & \text{for } x<a \\ 1 & \text{for } x>b \end{array}\right.
f[x]={1xlog[b/a]for axb0for x<a0for x>bf[x]=\left\{\begin{array}{cl} \frac{1}{x \log[b / a]} & \text{for } a \leq x \leq b \\ 0 & \text{for } x<a \\ 0 & \text{for } x>b \end{array}\right.

Limitations of Accuracy as a Measure

Accuracy alone doesn't provide a meaningful evaluation in finance-related ML, particularly in investment strategies. It fails to account for the probabilities associated with predictions. Cross-entropy loss, or loglosslog loss, is a better performance metric as it incorporates prediction probabilities.

The formula for log loss is:

L[Y,P]=log[Prob[YP]]=N1n=0N1k=0K1yn,klog[pn,k]L[Y, P]=-\log[\text{Prob}[Y \mid P]]=-N^{-1} \sum_{n=0}^{N-1} \sum_{k=0}^{K-1} y_{n, k} \log[p_{n, k}]

Accuracy doesn't suffice for hyperparameter tuning in financial applications. It should ideally be supplemented or replaced with metrics that better capture the complexities of financial decision-making.

Note: All these functionalities are available in both Python and Julia in the RiskLabAI library. You can view more here for Python and here for Julia.

References

  1. De Prado, M. L. (2018). Advances in financial machine learning. John Wiley & Sons.
  2. De Prado, M. M. L. (2020). Machine learning for asset managers. Cambridge University Press.