Stacking¶

This notebook will show you how to stacks model using aikit

Regression with OutSamplerTransformer¶

Let’s start by creating a simple Regression dataset

[1]:

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=1000, n_features=30, n_informative=10, n_targets=1, random_state=123)

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.1, shuffle=True, random_state=123)

[2]:

from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from sklearn.linear_model import Ridge

from aikit.pipeline import GraphPipeline
from aikit.models import OutSamplerTransformer

cv = 10

stacking_model = GraphPipeline(models = {
    "rf"   : OutSamplerTransformer(RandomForestRegressor(random_state=123, n_estimators=10) , cv = cv),
    "lgbm" : OutSamplerTransformer(LGBMRegressor(random_state=123, n_estimators=10)         ,  cv = cv),
    "ridge": OutSamplerTransformer(Ridge(random_state=123)     , cv = cv),
    "blender":Ridge()
    }, edges = [("rf","blender"),("lgbm","blender"),("ridge","blender")])


stacking_model.graphviz

Matplotlib won't work

C:\HOMEWARE\Anaconda3-Windows-x86_64\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

[2]:

This model behaves like a regular sklearn regressor.

It can be fitted :

[3]:

stacking_model.fit(Xtrain, ytrain)

[3]:

GraphPipeline(edges=[('rf', 'blender'), ('lgbm', 'blender'),
                     ('ridge', 'blender')],
              models={'blender': Ridge(alpha=1.0, copy_X=True,
                                       fit_intercept=True, max_iter=None,
                                       normalize=False, random_state=None,
                                       solver='auto', tol=0.001),
                      'lgbm': OutSamplerTransformer(columns_prefix=None, cv=10,
                                                    desired_output_type=None,
                                                    model=LGBMRegressor(boosting_type='gbdt',
                                                                        class_weight=...
                                                                              n_jobs=None,
                                                                              oob_score=False,
                                                                              random_state=123,
                                                                              verbose=0,
                                                                              warm_start=False),
                                                  random_state=123),
                      'ridge': OutSamplerTransformer(columns_prefix=None, cv=10,
                                                     desired_output_type=None,
                                                     model=Ridge(alpha=1.0,
                                                                 copy_X=True,
                                                                 fit_intercept=True,
                                                                 max_iter=None,
                                                                 normalize=False,
                                                                 random_state=123,
                                                                 solver='auto',
                                                                 tol=0.001),
                                                     random_state=123)},
              no_concat_nodes=None, verbose=False)

You can predict

[4]:

yhat_test = stacking_model.predict(Xtest)

[5]:

from sklearn.metrics import mean_squared_error
10_000 * mean_squared_error(ytest, yhat_test)

[5]:

9.978737586369565

Let’s describe what goes on during the fit:

cross_val_predict is called on each of the model => This create out-of-sample predictions for each observation
the each model is re-fitted on the full data => To be ready when called for a new prediction
The blender is given the out-of-sample prediction of the 3 models to fit a final model

The ‘OutSamplerTransformer’ object implements the logic to create out-of-sample prediction whereas GraphPipeline act to pass transformation from one node to the next(s).

With that logic you can do more complexe things¶

Let’s say you have missing value to fill before feeding the data to the models (Remark : this is not the case here).

You can just add a node at the top of the pipeline:

[6]:

from aikit.transformers import NumImputer

stacking_model = GraphPipeline(models = {
    "imp"  : NumImputer(),
    "rf"   : OutSamplerTransformer(RandomForestRegressor(random_state=123, n_estimators=10) , cv = cv),
    "lgbm" : OutSamplerTransformer(LGBMRegressor(random_state=123, n_estimators=10)         ,  cv = cv),
    "ridge": OutSamplerTransformer(Ridge(random_state=123)     , cv = cv),
    "blender":Ridge()
    }, edges = [("imp", "rf","blender"),("imp", "lgbm","blender"),("imp", "ridge","blender")])

stacking_model.graphviz

[6]:

Let’s say you want to pass to the blender the predictions of the model along with the original features.

You can just add another edge :

[7]:

from aikit.transformers import NumImputer

stacking_model = GraphPipeline(models = {
    "imp"  : NumImputer(),
    "rf"   : OutSamplerTransformer(RandomForestRegressor(random_state=123, n_estimators=10) , cv = cv),
    "lgbm" : OutSamplerTransformer(LGBMRegressor(random_state=123, n_estimators=10)         ,  cv = cv),
    "ridge": OutSamplerTransformer(Ridge(random_state=123)     , cv = cv),
    "blender":Ridge()
    }, edges = [("imp", "rf","blender"),("imp", "lgbm","blender"),("imp", "ridge","blender"), ("imp", "blender")])

stacking_model.graphviz

[7]:

Example on a Classification task¶

[8]:

from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
Xtrain.head(10)

[8]:

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home_dest
0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S	NaN	175.0	Dorchester, MA
1	1	Fortune, Mr. Mark	male	64.0	1	4	19950	263.0000	C23 C25 C27	S	NaN	NaN	Winnipeg, MB
2	1	Sagesser, Mlle. Emma	female	24.0	0	0	PC 17477	69.3000	B35	C	9	NaN	NaN
3	3	Panula, Master. Urho Abraham	male	2.0	4	1	3101295	39.6875	NaN	S	NaN	NaN	NaN
4	1	Maioni, Miss. Roberta	female	16.0	0	0	110152	86.5000	B79	S	8	NaN	NaN
5	3	Waelens, Mr. Achille	male	22.0	0	0	345767	9.0000	NaN	S	NaN	NaN	Antwerp, Belgium / Stanton, OH
6	3	Reed, Mr. James George	male	NaN	0	0	362316	7.2500	NaN	S	NaN	NaN	NaN
7	1	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	0	0	17466	25.9292	D17	S	8	NaN	Brooklyn, NY
8	1	Smith, Mrs. Lucien Philip (Mary Eloise Hughes)	female	18.0	1	0	13695	60.0000	C31	S	6	NaN	Huntington, WV
9	1	Rowe, Mr. Alfred G	male	33.0	0	0	113790	26.5500	NaN	S	NaN	109.0	London

You can also stack models that works on different part of the data, for example :¶

a CountVectorizer + Logit model that works on “text-like” columns
along with a NumericalEncoder + Random Forest RandomForestClassifier for the other columns

[9]:

from aikit.transformers import CountVectorizerWrapper, NumericalEncoder, NumImputer, ColumnsSelector
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from aikit.models import OutSamplerTransformer
from aikit.pipeline import GraphPipeline

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(10, random_state=123, shuffle=True)


text_cols     = ["name","ticket"]
non_text_cols = [c for c in Xtrain.columns if c not in text_cols]

gpipeline = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "rf":OutSamplerTransformer(RandomForestClassifier(n_estimators=100, random_state=123),cv=cv),
    "logit":OutSamplerTransformer(LogisticRegression(random_state=123),cv=cv),
    "blender":LogisticRegression(random_state=123)
},
              edges = [("sel","enc","imp","rf", "blender"),("vect","logit","blender")])

gpipeline.fit(Xtrain,y_train)
gpipeline.graphviz

C:\HOMEWARE\Anaconda3-Windows-x86_64\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

[9]:

Probabilities calibration (Platt’s scaling)¶

Using that object you can also re-calibrate your probabilities. This can be done by using a method called ‘Platt’s scaling’ https://en.wikipedia.org/wiki/Platt_scaling/)

Which consist in feeding the probability of one model to a Logistic Regression which will re-scale them.

[10]:

rf_rescaled = GraphPipeline(models = {
    "sel"  : ColumnsSelector(columns_to_use=non_text_cols),
    "enc"  : NumericalEncoder(),
    "imp"  : NumImputer(),
    "rf"   : OutSamplerTransformer( RandomForestClassifier(class_weight = "auto"), cv = cv),
    "scaling":LogisticRegression()
    }, edges = [('sel','enc','imp','rf','scaling')]
)
rf_rescaled.graphviz

[10]: