Stacking¶
This notebook will show you how to stacks model using aikit
Regression with OutSamplerTransformer¶
Let’s start by creating a simple Regression dataset
[1]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=1000, n_features=30, n_informative=10, n_targets=1, random_state=123)
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.1, shuffle=True, random_state=123)
[2]:
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import Ridge
from aikit.pipeline import GraphPipeline
from aikit.models import OutSamplerTransformer
cv = 10
stacking_model = GraphPipeline(models = {
"rf" : OutSamplerTransformer(RandomForestRegressor(random_state=123, n_estimators=10) , cv = cv),
"lgbm" : OutSamplerTransformer(LGBMRegressor(random_state=123, n_estimators=10) , cv = cv),
"ridge": OutSamplerTransformer(Ridge(random_state=123) , cv = cv),
"blender":Ridge()
}, edges = [("rf","blender"),("lgbm","blender"),("ridge","blender")])
stacking_model.graphviz
Matplotlib won't work
C:\HOMEWARE\Anaconda3-Windows-x86_64\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
[2]:
This model behaves like a regular sklearn regressor.
It can be fitted :
[3]:
stacking_model.fit(Xtrain, ytrain)
[3]:
GraphPipeline(edges=[('rf', 'blender'), ('lgbm', 'blender'),
('ridge', 'blender')],
models={'blender': Ridge(alpha=1.0, copy_X=True,
fit_intercept=True, max_iter=None,
normalize=False, random_state=None,
solver='auto', tol=0.001),
'lgbm': OutSamplerTransformer(columns_prefix=None, cv=10,
desired_output_type=None,
model=LGBMRegressor(boosting_type='gbdt',
class_weight=...
n_jobs=None,
oob_score=False,
random_state=123,
verbose=0,
warm_start=False),
random_state=123),
'ridge': OutSamplerTransformer(columns_prefix=None, cv=10,
desired_output_type=None,
model=Ridge(alpha=1.0,
copy_X=True,
fit_intercept=True,
max_iter=None,
normalize=False,
random_state=123,
solver='auto',
tol=0.001),
random_state=123)},
no_concat_nodes=None, verbose=False)
You can predict
[4]:
yhat_test = stacking_model.predict(Xtest)
[5]:
from sklearn.metrics import mean_squared_error
10_000 * mean_squared_error(ytest, yhat_test)
[5]:
9.978737586369565
Let’s describe what goes on during the fit:
- cross_val_predict is called on each of the model => This create out-of-sample predictions for each observation
- the each model is re-fitted on the full data => To be ready when called for a new prediction
- The blender is given the out-of-sample prediction of the 3 models to fit a final model
The ‘OutSamplerTransformer’ object implements the logic to create out-of-sample prediction whereas GraphPipeline act to pass transformation from one node to the next(s).
With that logic you can do more complexe things¶
Let’s say you have missing value to fill before feeding the data to the models (Remark : this is not the case here).
You can just add a node at the top of the pipeline:
[6]:
from aikit.transformers import NumImputer
stacking_model = GraphPipeline(models = {
"imp" : NumImputer(),
"rf" : OutSamplerTransformer(RandomForestRegressor(random_state=123, n_estimators=10) , cv = cv),
"lgbm" : OutSamplerTransformer(LGBMRegressor(random_state=123, n_estimators=10) , cv = cv),
"ridge": OutSamplerTransformer(Ridge(random_state=123) , cv = cv),
"blender":Ridge()
}, edges = [("imp", "rf","blender"),("imp", "lgbm","blender"),("imp", "ridge","blender")])
stacking_model.graphviz
[6]:
Let’s say you want to pass to the blender the predictions of the model along with the original features.
You can just add another edge :
[7]:
from aikit.transformers import NumImputer
stacking_model = GraphPipeline(models = {
"imp" : NumImputer(),
"rf" : OutSamplerTransformer(RandomForestRegressor(random_state=123, n_estimators=10) , cv = cv),
"lgbm" : OutSamplerTransformer(LGBMRegressor(random_state=123, n_estimators=10) , cv = cv),
"ridge": OutSamplerTransformer(Ridge(random_state=123) , cv = cv),
"blender":Ridge()
}, edges = [("imp", "rf","blender"),("imp", "lgbm","blender"),("imp", "ridge","blender"), ("imp", "blender")])
stacking_model.graphviz
[7]:
Example on a Classification task¶
[8]:
from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
Xtrain.head(10)
[8]:
pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home_dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | NaN | 175.0 | Dorchester, MA |
1 | 1 | Fortune, Mr. Mark | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S | NaN | NaN | Winnipeg, MB |
2 | 1 | Sagesser, Mlle. Emma | female | 24.0 | 0 | 0 | PC 17477 | 69.3000 | B35 | C | 9 | NaN | NaN |
3 | 3 | Panula, Master. Urho Abraham | male | 2.0 | 4 | 1 | 3101295 | 39.6875 | NaN | S | NaN | NaN | NaN |
4 | 1 | Maioni, Miss. Roberta | female | 16.0 | 0 | 0 | 110152 | 86.5000 | B79 | S | 8 | NaN | NaN |
5 | 3 | Waelens, Mr. Achille | male | 22.0 | 0 | 0 | 345767 | 9.0000 | NaN | S | NaN | NaN | Antwerp, Belgium / Stanton, OH |
6 | 3 | Reed, Mr. James George | male | NaN | 0 | 0 | 362316 | 7.2500 | NaN | S | NaN | NaN | NaN |
7 | 1 | Swift, Mrs. Frederick Joel (Margaret Welles Ba... | female | 48.0 | 0 | 0 | 17466 | 25.9292 | D17 | S | 8 | NaN | Brooklyn, NY |
8 | 1 | Smith, Mrs. Lucien Philip (Mary Eloise Hughes) | female | 18.0 | 1 | 0 | 13695 | 60.0000 | C31 | S | 6 | NaN | Huntington, WV |
9 | 1 | Rowe, Mr. Alfred G | male | 33.0 | 0 | 0 | 113790 | 26.5500 | NaN | S | NaN | 109.0 | London |
You can also stack models that works on different part of the data, for example :¶
- a CountVectorizer + Logit model that works on “text-like” columns
- along with a NumericalEncoder + Random Forest RandomForestClassifier for the other columns
[9]:
from aikit.transformers import CountVectorizerWrapper, NumericalEncoder, NumImputer, ColumnsSelector
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from aikit.models import OutSamplerTransformer
from aikit.pipeline import GraphPipeline
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(10, random_state=123, shuffle=True)
text_cols = ["name","ticket"]
non_text_cols = [c for c in Xtrain.columns if c not in text_cols]
gpipeline = GraphPipeline(models = {
"sel":ColumnsSelector(columns_to_use=non_text_cols),
"enc":NumericalEncoder(columns_to_use="object"),
"imp":NumImputer(),
"vect":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
"rf":OutSamplerTransformer(RandomForestClassifier(n_estimators=100, random_state=123),cv=cv),
"logit":OutSamplerTransformer(LogisticRegression(random_state=123),cv=cv),
"blender":LogisticRegression(random_state=123)
},
edges = [("sel","enc","imp","rf", "blender"),("vect","logit","blender")])
gpipeline.fit(Xtrain,y_train)
gpipeline.graphviz
C:\HOMEWARE\Anaconda3-Windows-x86_64\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
[9]:
Probabilities calibration (Platt’s scaling)¶
Using that object you can also re-calibrate your probabilities. This can be done by using a method called ‘Platt’s scaling’ https://en.wikipedia.org/wiki/Platt_scaling/)
Which consist in feeding the probability of one model to a Logistic Regression which will re-scale them.
[10]:
rf_rescaled = GraphPipeline(models = {
"sel" : ColumnsSelector(columns_to_use=non_text_cols),
"enc" : NumericalEncoder(),
"imp" : NumImputer(),
"rf" : OutSamplerTransformer( RandomForestClassifier(class_weight = "auto"), cv = cv),
"scaling":LogisticRegression()
}, edges = [('sel','enc','imp','rf','scaling')]
)
rf_rescaled.graphviz
[10]: