- Published on
Hyper-Parameter Tuning with Cross-Validation
Hyper-Parameter Tuning with Cross-Validation
Importance of Hyper-Parameter Tuning
Hyperparameter tuning is crucial for optimizing machine learning (ML) algorithms. Effective tuning results in improved real-world performance. Cross-validation (CV) plays a vital role in this, especially in the finance sector, where conventional approaches often fall short. This blog focuses on utilizing the Purged k-fold CV method for hyper-parameter optimization.
Purged-Kfold Integration into MLJBase
For hyperparameter tuning, grid search is often an initial step to understand the data's underlying structure. In MLJBase, GridSearchcV
uses a CV generator, and to avoid overfitting, our PurgedKFold class can be passed as an argument.
def clf_hyper_fit(
feature_data: pd.DataFrame,
label: pd.DataFrame,
times: pd.Series,
pipe_clf: Pipeline,
param_grid: dict,
validator_type: str = 'purgedkfold',
validator_params: dict = None,
bagging: list = [0, -1, 1.],
rnd_search_iter: int = 0,
n_jobs: int = -1,
**fit_params
) -> MyPipeline:
if set(label.values) == {0, 1}:
scoring = 'f1' # F1-score for meta-labeling
else:
scoring = 'neg_log_loss' # Symmetric towards all cases
if validator_params is None:
validator_params = {
'times' : times,
'n_splits' : 5,
'embargo' : 0.01,
}
# Hyperparameter search on train data
inner_cv = CrossValidatorController(
validator_type,
**validator_params
).cross_validator
if rnd_search_iter == 0:
gs = GridSearchCV(estimator=pipe_clf, param_grid=param_grid, scoring=scoring, cv=inner_cv, n_jobs=n_jobs)
else:
gs = RandomizedSearchCV(estimator=pipe_clf, param_distributions=param_grid, scoring=scoring,
cv=inner_cv, n_jobs=n_jobs, n_iter=rnd_search_iter)
gs = gs.fit(feature_data, label, **fit_params).best_estimator_ # Pipeline
# Fit validated model on the entirety of the data
if bagging[1] > 0:
gs = BaggingClassifier(base_estimator=MyPipeline(gs.steps), n_estimators=int(bagging[0]),
max_samples=float(bagging[1]), max_features=float(bagging[2]), n_jobs=n_jobs)
gs = gs.fit(feature_data, label, sample_weight=fit_params[gs.base_estimator.steps[-1][0] + '__sample_weight'])
gs = Pipeline([('bag', gs)])
return gs
Non-Negative Parameters
Non-negative hyperparameters are common in some ML algorithms, such as the SVC classifier and RBF kernel. Rather than using a uniform distribution for sampling, using a log-uniform distribution is often more effective for such parameters.
For a variable to have a log-uniform distribution between and , its CDF and PDF can be defined as:
Limitations of Accuracy as a Measure
Accuracy alone doesn't provide a meaningful evaluation in finance-related ML, particularly in investment strategies. It fails to account for the probabilities associated with predictions. Cross-entropy loss, or , is a better performance metric as it incorporates prediction probabilities.
The formula for log loss is:
Accuracy doesn't suffice for hyperparameter tuning in financial applications. It should ideally be supplemented or replaced with metrics that better capture the complexities of financial decision-making.
Note: All these functionalities are available in both Python and Julia in the RiskLabAI library. You can view more here for Python and here for Julia.
References
- De Prado, M. L. (2018). Advances in financial machine learning. John Wiley & Sons.
- De Prado, M. M. L. (2020). Machine learning for asset managers. Cambridge University Press.