Welcome to lrboost’s documentation!¶

lrboost is an scikit-learn compatible simple stacking protocol for prediction. Additional utilities for machine learning are also included!

Getting Started¶

LRBoostRegressor works in three steps.

Fit a linear model to a target y
- This is the primary model accesible via lrboost.primary_model
Fit a tree-based model to the residual (y_pred - y) of the linear model
- This is the secondary model accesible via lrboost.secondary_model
Combine the two predictions into a final prediction in the scale of the original target

LRBoostRegressor defaults to sklearn.linear_model.RidgeCV() and sklearn.ensemble.HistGradientBoostingRegressor() as the linear (primary) and non-linear (secondary) model respectively.

>>> from sklearn.datasets import load_diabetes
>>> from lrboost import LRBoostRegressor
>>> X, y = load_iris(return_X_y=True)
>>> lrb = LRBoostRegressor().fit(X, y)
>>> predictions = lrb.predict(X)
>>> detailed_predictions = lrb.predict(X, detail=True)
>>> print(lrb.primary_model.score(X, y)) #R2
>>> print(lrb.score(X, y)) #R2
>>> 0.512
>>> 0.933

The linear and non-linear models are both fit in the fit() method and used to then predict on any new data. Because lrboost is a very slightly modified scikit-learn class, you can hyperparameter tune the tree model as you would normally.

predict(X) returns an array-like of final predictions with an option for predict(X, detail=True)
predict_dist(X) provides probabilistic predictions associated with NGBoost or XGBoost-Distribution as the non-linear estimators.

Any sklearn compatible estimator can be used with lrboost, and you can unpack kwargs as needed.

>>> from sklearn.datasets import load_iris
>>> from sklearn.ensemble import RandomForestRegressor
>>> from lrboost import LRBoostRegressor
>>> X, y = load_iris(return_X_y=True)
>>> ridge_args = {"alphas": np.logspace(-4, 3, 10, endpoint=True),
                 "cv": 5}
>>> rf_args = {"n_estimators": 50,
              "n_jobs": -1}
>>> lrb = LRBoostRegressor(primary_model=RidgeCV(**ridge_args), secondary_model=RandomForestRegressor(**rf_args))
>>> lrb = LRBoostRegressor.fit(X, y)
>>> predictions = lrb.predict(X)

lrboost is not going to magically provide improved error in all circumstances.
Situations with extrapolation outside of the training dataset might be particularly useful.

Hyperparamter Tuning¶

Model Comparison - Example 1¶

This is a (simplified) example of predicting clutch minutes from non-clutch minutes in NBA basketball.
It has a known linear and non-linear combination and extrapolation can be difficult.

Using the above test/train split we can show an extrapolation task on the tails of the distribution.

>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.metrics import mean_squared_error
>>> from sklearn.ensemble import HistGradientBoostingRegressor
>>> clutch = pd.read_csv('../examples/clutch.csv')
>>> train_mask = (clutch['nonclutch_min'] <= 4000) & (clutch['nonclutch_min'] >= 750)
>>> train = clutch[train_mask]
>>> test = clutch[~train_mask]
>>> X_train = train[['nonclutch_min']]
>>> y_train = train['clutch_min']
>>> X_test = test[['nonclutch_min']]
>>> y_test = test['clutch_min']
>>> gbm = HistGradientBoostingRegressor(max_iter=500, random_state=42).fit(X_train, y_train)
>>> lrb = LRBoostRegressor(secondary_model=HistGradientBoostingRegressor(max_iter=500, random_state=42)).fit(X_train, y_train)
>>> print(f"Ridge RMSE: {round(mean_squared_error(lrb.primary_model.predict(X_test), y_test), 2)}")
>>> print(f"HistGradientBoostingRegressor RMSE: {round(mean_squared_error(gbm.predict(X_test), y_test), 2)}")
>>> print(f"lrboost RMSE: {round(mean_squared_error(lrb.predict(X_test), y_test), 2)}")
>>> Ridge RMSE: 1385.81
>>> HistGradientBoostingRegressor RMSE: 3145.87
>>> lrboost RMSE: 1080.42

If we also attempt a general train/test split, lrboost performs well.

>>> Ridge RMSE: 570.01
>>> HistGradientBoostingRegressor RMSE: 743.66
>>> lrboost RMSE: 733.4

Model Comparison - Example 2¶

The following are some simple examples taken from Zhang et al. (2019)

>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.metrics import mean_squared_error
>>> from sklearn.ensemble import HistGradientBoostingRegressor
>>> from sklearn.model_selection import train_test_split
>>> concrete = pd.read_csv("../examples/concrete_data.csv")
>>> features = ['cement', 'slag', 'fly_ash', 'water', 'superplastic', 'coarse_agg', 'fine_agg', 'age', 'cw_ratio']
>>> target = 'ccs'
>>> def evaluate_models(X_train, X_test, y_train, y_test):
   >>> lrb = LRBoostRegressor(primary_model=RidgeCV(alphas=np.logspace(-4, 3, 10, endpoint=True)))
   >>> lrb.fit(X_train, y_train.ravel())
   >>> detailed_predictions = lrb.predict(X_test, detail=True)
   >>> primary_predictions = detailed_predictions['primary_prediction']
   >>> rb_predictions = detailed_predictions['final_prediction']
   >>> hgb = HistGradientBoostingRegressor()
   >>> hgb.fit(X_train, y_train.ravel())
   >>> hgb_predictions = hgb.predict(X_test)
   >>> print(f"Ridge RMSE: {round(mean_squared_error(primary_predictions, y_test.ravel()), 2)}")
   >>> print(f"HistGradientBoostingRegressor RMSE: {round(mean_squared_error(hgb_predictions, y_test.ravel()), 2)}")
   >>> print(f"lrboost RMSE: {round(mean_squared_error(lrb_predictions, y_test.ravel()), 2)}")

>>> # Scenario 1: 75/25 train/test (Interpolation)
>>> X_train, X_test, y_train, y_test = train_test_split(concrete[features],
                                                        concrete[target],
                                                        train_size=0.75, random_state=100)
>>> evaluate_models(X_train, X_test, y_train, y_test)
>>> # Ridge RMSE: 112.4
>>> # HistGradientBoostingRegressor RMSE: 26.33
>>> # lrboost RMSE: 25.06

>>> # Scenario 2: 50/50 train/test (Interpolation)
>>> X_train, X_test, y_train, y_test = train_test_split(concrete[features],
                                                        concrete[target],
                                                        train_size=0.50, random_state=100)
>>> evaluate_models(X_train, X_test, y_train, y_test)
>>> # Ridge RMSE: 107.6
>>> # HistGradientBoostingRegressor RMSE: 26.6
>>> # lrboost RMSE: 23.55

>>> # Scenario 3: Training: CCS > 25, Testing: CCS <= 25 (Extrapolation)
>>> train = concrete.loc[concrete['ccs'] > 25]
>>> test = concrete.loc[concrete['ccs'] <= 25]
>>> X_train = train[features]
>>> y_train = train[target]
>>> X_test = train[features]
>>> y_test = train[target]
>>> evaluate_models(X_train, X_test, y_train, y_test)
>>> # Ridge RMSE: 89.26
>>> # HistGradientBoostingRegressor RMSE: 4.21
>>> # lrboost RMSE: 3.7

With zero tuning of either the lrboost internal GBDT fit to the residual or the “standard” GBDT, lrboost performs well.