GraphPipeline getting started

This notebook is here to show a few things that can be done by the package.

It doesn’t means that these are the things you should do on that particular dataset.

Let’s load titanic dataset to test a few things

[1]:
import warnings
warnings.filterwarnings('ignore') # to remove gensim warning
[2]:
from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
[3]:
Xtrain.head(20)
[3]:
pclass name sex age sibsp parch ticket fare cabin embarked boat body home_dest
0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S NaN 175.0 Dorchester, MA
1 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S NaN NaN Winnipeg, MB
2 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 69.3000 B35 C 9 NaN NaN
3 3 Panula, Master. Urho Abraham male 2.0 4 1 3101295 39.6875 NaN S NaN NaN NaN
4 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5000 B79 S 8 NaN NaN
5 3 Waelens, Mr. Achille male 22.0 0 0 345767 9.0000 NaN S NaN NaN Antwerp, Belgium / Stanton, OH
6 3 Reed, Mr. James George male NaN 0 0 362316 7.2500 NaN S NaN NaN NaN
7 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S 8 NaN Brooklyn, NY
8 1 Smith, Mrs. Lucien Philip (Mary Eloise Hughes) female 18.0 1 0 13695 60.0000 C31 S 6 NaN Huntington, WV
9 1 Rowe, Mr. Alfred G male 33.0 0 0 113790 26.5500 NaN S NaN 109.0 London
10 3 Meo, Mr. Alfonzo male 55.5 0 0 A.5. 11206 8.0500 NaN S NaN 201.0 NaN
11 3 Abbott, Mr. Rossmore Edward male 16.0 1 1 C.A. 2673 20.2500 NaN S NaN 190.0 East Providence, RI
12 3 Elias, Mr. Dibo male NaN 0 0 2674 7.2250 NaN C NaN NaN NaN
13 2 Reynaldo, Ms. Encarnacion female 28.0 0 0 230434 13.0000 NaN S 9 NaN Spain
14 3 Khalil, Mr. Betros male NaN 1 0 2660 14.4542 NaN C NaN NaN NaN
15 1 Daniels, Miss. Sarah female 33.0 0 0 113781 151.5500 NaN S 8 NaN NaN
16 3 Ford, Miss. Robina Maggie 'Ruby' female 9.0 2 2 W./C. 6608 34.3750 NaN S NaN NaN Rotherfield, Sussex, England Essex Co, MA
17 3 Thorneycroft, Mrs. Percival (Florence Kate White) female NaN 1 0 376564 16.1000 NaN S 10 NaN NaN
18 3 Lennon, Mr. Denis male NaN 1 0 370371 15.5000 NaN Q NaN NaN NaN
19 3 de Pelsmaeker, Mr. Alfons male 16.0 0 0 345778 9.5000 NaN S NaN NaN NaN
[4]:
y_train[0:20]
[4]:
array([0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0],
      dtype=int64)

For now let’s ignore the Name and Ticket column which should probably be handled as text

[5]:
import pandas as pd
from aikit.transformers import TruncatedSVDWrapper, NumImputer, CountVectorizerWrapper, NumericalEncoder
from aikit.pipeline import GraphPipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
Matplotlib won't work
[6]:
non_text_cols = [c for c in Xtrain.columns if c not in ("ticket","name")] # everything that is not text
non_text_cols
[6]:
['pclass',
 'sex',
 'age',
 'sibsp',
 'parch',
 'fare',
 'cabin',
 'embarked',
 'boat',
 'body',
 'home_dest']
[7]:
gpipeline = GraphPipeline(models = { "enc":NumericalEncoder(),
                                     "imp":NumImputer(),
                                     "forest":RandomForestClassifier(n_estimators=100)
                                   },
                          edges = [("enc","imp","forest")])

gpipeline.fit(Xtrain.loc[:,non_text_cols],y_train)
gpipeline.graphviz
[7]:
../_images/notebooks_GraphPipeline_9_0.svg

Let’s do a cross-validation

[8]:
from aikit.cross_validation import cross_validation
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(10, random_state=123, shuffle=True)

cv_result = cross_validation(gpipeline, Xtrain.loc[:,non_text_cols], y_train,cv = cv,
                             scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    3.8s finished
[8]:
test_roc_auc test_accuracy test_neg_log_loss train_roc_auc train_accuracy train_neg_log_loss fit_time score_time n_test_samples fold_nb
0 0.997332 0.990476 -0.050391 0.999830 0.995758 -0.029559 0.200567 0.076803 105 0
1 0.968369 0.961905 -0.723250 0.999986 0.997879 -0.022651 0.192597 0.066856 105 1
2 0.983232 0.942857 -0.154483 0.999816 0.995758 -0.026256 0.200526 0.069852 105 2
3 1.000000 1.000000 -0.035742 0.999707 0.995758 -0.030825 0.210022 0.070714 105 3
4 0.996380 0.961905 -0.088300 0.999802 0.995758 -0.028642 0.199074 0.064868 105 4
5 0.991806 0.952381 -0.125793 0.999797 0.997879 -0.025816 0.193644 0.070361 105 5
6 1.000000 1.000000 -0.040940 0.999703 0.995758 -0.029609 0.215009 0.065831 105 6
7 0.996380 0.980952 -0.088508 0.999842 0.996819 -0.026614 0.184709 0.077791 105 7
8 0.992838 0.971154 -0.107017 0.999793 0.995763 -0.027610 0.187533 0.063796 104 8
9 0.999613 0.980769 -0.072026 0.999764 0.995763 -0.028887 0.199505 0.062847 104 9

This cross-validate the complete Pipeline. The difference with sklearn function is that : * you can score more than one metric at a time * you retrieve train and test score

[9]:
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[9]:
test_roc_auc         0.992595
test_accuracy        0.974240
test_neg_log_loss   -0.148645
dtype: float64

We can do the same but selecting the columns directly in the pipeline :

[10]:
from aikit.transformers import ColumnsSelector
gpipeline2 = GraphPipeline(models = { "sel":ColumnsSelector(columns_to_use=non_text_cols),
                                      "enc":NumericalEncoder(columns_to_use="object"),
                                      "imp":NumImputer(),
                                      "forest":RandomForestClassifier(n_estimators=100, random_state=123)
                                    },
                         edges = [("sel","enc","imp","forest")])

gpipeline2.fit(Xtrain,y_train)
gpipeline2.graphviz
[10]:
../_images/notebooks_GraphPipeline_15_0.svg

Remark : ‘columns_to_use=”object”’ tells aikit to encode the columns of type object, it will keep the rest untouched

[11]:
cv_result = cross_validation(gpipeline2,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    4.0s finished
[11]:
test_roc_auc         0.991698
test_accuracy        0.972335
test_neg_log_loss   -0.178280
dtype: float64

Now let’s see what we can do with the columns we excluded. We could craft features from them, but let’s try to use them as text directly.

[12]:
text_cols = ["ticket","name"]
vect = CountVectorizerWrapper(analyzer="word", columns_to_use=text_cols)
vect.fit(Xtrain,y_train)
[12]:
CountVectorizerWrapper(analyzer='word', column_prefix='BAG',
                       columns_to_use=['ticket', 'name'],
                       desired_output_type='SparseArray',
                       drop_unused_columns=True, drop_used_columns=True,
                       max_df=1.0, max_features=None, min_df=1, ngram_range=1,
                       regex_match=False, tfidf=False, vocabulary=None)

Remark : aikit CountVectorizer can direcly work on 2 (or more) columns, no need to use a FeatureUnion or something of the sort

[13]:
features = vect.get_feature_names()
features[0:20] + ["..."] + features[-20:]
[13]:
['ticket__BAG__10482',
 'ticket__BAG__110152',
 'ticket__BAG__110413',
 'ticket__BAG__110465',
 'ticket__BAG__110469',
 'ticket__BAG__110489',
 'ticket__BAG__110564',
 'ticket__BAG__110813',
 'ticket__BAG__111163',
 'ticket__BAG__111240',
 'ticket__BAG__111320',
 'ticket__BAG__111361',
 'ticket__BAG__111369',
 'ticket__BAG__111426',
 'ticket__BAG__111427',
 'ticket__BAG__112050',
 'ticket__BAG__112052',
 'ticket__BAG__112053',
 'ticket__BAG__112058',
 'ticket__BAG__11206',
 '...',
 'name__BAG__woolf',
 'name__BAG__woolner',
 'name__BAG__worth',
 'name__BAG__wright',
 'name__BAG__wyckoff',
 'name__BAG__yarred',
 'name__BAG__yasbeck',
 'name__BAG__ylio',
 'name__BAG__yoto',
 'name__BAG__young',
 'name__BAG__youseff',
 'name__BAG__yousif',
 'name__BAG__youssef',
 'name__BAG__yousseff',
 'name__BAG__yrois',
 'name__BAG__zabour',
 'name__BAG__zakarian',
 'name__BAG__zebley',
 'name__BAG__zenni',
 'name__BAG__zillah']

The encoder directly encodes the 2 features

[14]:
xx_res = vect.transform(Xtrain)
xx_res
[14]:
<1048x2440 sparse matrix of type '<class 'numpy.int32'>'
        with 5414 stored elements in COOrdinate format>

Again let’s create a GraphPipeline to cross-validate

[15]:
gpipeline3 = GraphPipeline(models = {"vect":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
                                     "logit":LogisticRegression(solver="liblinear", random_state=123)},
                           edges=[("vect","logit")])
gpipeline3.fit(Xtrain,y_train)
gpipeline3.graphviz
[15]:
../_images/notebooks_GraphPipeline_25_0.svg
[16]:
cv_result = cross_validation(gpipeline3, Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.9s finished
[16]:
test_roc_auc         0.850918
test_accuracy        0.819679
test_neg_log_loss   -0.451681
dtype: float64

We can also try we “bag of char”

[17]:
gpipeline4 = GraphPipeline(models = {
        "vect": CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
        "logit": LogisticRegression(solver="liblinear", random_state=123) }, edges=[("vect","logit")])
gpipeline4.fit(Xtrain,y_train)
gpipeline4.graphviz
[17]:
../_images/notebooks_GraphPipeline_28_0.svg
[18]:
cv_result = cross_validation(gpipeline4,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    5.9s finished
[18]:
test_roc_auc         0.849773
test_accuracy        0.813956
test_neg_log_loss   -0.559254
dtype: float64

Now let’s use all the columns

[19]:
gpipeline5 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),("vect","rf")])
gpipeline5.fit(Xtrain,y_train)
gpipeline5.graphviz
[19]:
../_images/notebooks_GraphPipeline_31_0.svg

This model uses both set of columns: * bag of word * and categorical/numerical features

[20]:
cv_result = cross_validation(gpipeline5,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   11.0s finished
[20]:
test_roc_auc         0.992779
test_accuracy        0.968507
test_neg_log_loss   -0.173236
dtype: float64

We can also use both Bag of Char and Bag of Word

[21]:
gpipeline6 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_char":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_word":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),("vect_char","rf"),("vect_word","rf")])
gpipeline6.fit(Xtrain,y_train)
gpipeline6.graphviz
[21]:
../_images/notebooks_GraphPipeline_35_0.svg
[22]:
cv_result = cross_validation(gpipeline6,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   13.9s finished
[22]:
test_roc_auc         0.947360
test_accuracy        0.843516
test_neg_log_loss   -0.325666
dtype: float64

Maybe we can try SVD to limit dimension of bag of char/word features

[23]:
gpipeline7 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "svd":TruncatedSVDWrapper(n_components=100, random_state=123),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel", "enc","imp","rf"),("vect_word","svd","rf"),("vect_char","svd","rf")])
gpipeline7.fit(Xtrain,y_train)
gpipeline7.graphviz
[23]:
../_images/notebooks_GraphPipeline_38_0.svg
[24]:
cv_result = cross_validation(gpipeline7,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   23.4s finished
[24]:
test_roc_auc         0.992953
test_accuracy        0.972326
test_neg_log_loss   -0.167037
dtype: float64

We can even add ‘SVD’ columns AND bag of word/char columns

[25]:
gpipeline8 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "svd":TruncatedSVDWrapper(n_components=100, random_state=123),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
            edges = [("sel","enc","imp","rf"),("vect_word","svd","rf"),("vect_char","svd","rf"),("vect_word","rf"),("vect_char","rf")])

gpipeline8.graphviz
[25]:
../_images/notebooks_GraphPipeline_41_0.svg
[26]:
cv_result = cross_validation(gpipeline8,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   22.1s finished
[26]:
test_roc_auc         0.941329
test_accuracy        0.834011
test_neg_log_loss   -0.334545
dtype: float64

Instead of ‘SVD’ we can add a layer that filter columns…

[27]:
from aikit.transformers import FeaturesSelectorClassifier
[28]:
gpipeline9 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "selector":FeaturesSelectorClassifier(n_components=20),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),("vect_word","selector","rf"),("vect_char","selector","rf")])

gpipeline9.graphviz
[28]:
../_images/notebooks_GraphPipeline_45_0.svg

Retrieve feature importance

Let’s use that complicated example to show how to retrieve the feature importance

[29]:
gpipeline9.fit(Xtrain, y_train)

df_imp = pd.Series(gpipeline9.models["rf"].feature_importances_,
                  index = gpipeline9.get_input_features_at_node("rf"))
df_imp.sort_values(ascending=False,inplace=True)
df_imp
[29]:
boat____null__             3.839758e-01
sex__female                3.816301e-02
name__BAG__mr              3.715979e-02
name__BAG__mr.             3.636483e-02
fare                       3.419880e-02
name__BAG__mr.             3.133609e-02
sex__male                  2.962421e-02
name__BAG__r.              2.910019e-02
name__BAG__s.              2.776609e-02
boat__15                   2.672268e-02
age                        2.643157e-02
name__BAG__s.              2.500470e-02
name__BAG__ mr.            2.249752e-02
boat__13                   1.863079e-02
boat____default__          1.711391e-02
pclass                     1.665125e-02
name__BAG__                1.597853e-02
sibsp                      1.524516e-02
home_dest____null__        1.015056e-02
boat__7                    9.817018e-03
home_dest____default__     9.534058e-03
boat__C                    9.453317e-03
cabin____null__            8.265959e-03
cabin____default__         7.290940e-03
parch                      7.138940e-03
embarked__S                6.643220e-03
boat__5                    6.206360e-03
name__BAG__iss.            6.139824e-03
embarked__C                6.040638e-03
boat__3                    5.547742e-03
name__BAG__(               5.352397e-03
name__BAG__mr              5.260205e-03
body_isnull                4.829877e-03
name__BAG__ (              4.360392e-03
boat__16                   4.245866e-03
boat__9                    4.224166e-03
boat__D                    4.194419e-03
name__BAG__ss              4.076246e-03
embarked__Q                4.047912e-03
name__BAG__mrs             3.602001e-03
body                       2.955222e-03
name__BAG__rs              2.899086e-03
name__BAG__rs.             2.869114e-03
age_isnull                 2.859144e-03
boat__14                   2.809765e-03
boat__10                   2.695927e-03
name__BAG__rs.             2.165917e-03
boat__12                   2.103210e-03
name__BAG__mrs.            2.064884e-03
home_dest__New York, NY    1.799501e-03
boat__11                   1.495248e-03
name__BAG__miss            1.054318e-03
name__BAG__mrs             9.616334e-04
boat__4                    9.420111e-04
boat__6                    8.184419e-04
home_dest__London          7.515602e-04
boat__8                    3.679950e-04
fare_isnull                2.438310e-09
dtype: float64
[30]:
cv_result = cross_validation(gpipeline9,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   15.0s finished
[30]:
test_roc_auc         0.994108
test_accuracy        0.973288
test_neg_log_loss   -0.153255
dtype: float64
[31]:
gpipeline10 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "svd":TruncatedSVDWrapper(n_components=10),
    "selector":FeaturesSelectorClassifier(n_components=10, random_state=123),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),
                       ("vect_word","selector","rf"),
                       ("vect_char","selector","rf"),
                       ("vect_word","svd","rf"),
                       ("vect_char","svd","rf")])

gpipeline10.fit(Xtrain,y_train)
gpipeline10.graphviz
[31]:
../_images/notebooks_GraphPipeline_49_0.svg

In this model here is what is done : * categorical columns are encoded (‘enc’) * missing values are filled (‘imp’) * bag of word and bag of char are created, for the two text features * an SVD is done on those * a selector is called to select most important bag of word/char features * everything is given to a RandomForest

[32]:
cv_result = cross_validation(gpipeline10,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   16.9s finished
[32]:
test_roc_auc         0.994333
test_accuracy        0.975201
test_neg_log_loss   -0.143788
dtype: float64

As we saw the GraphPipeline allow flexibility in the creation of models and several choices can be easily tested.

Again, it is not the best possible choices for that database, the example are here to illustrate the capabilities.

Better score could be obtained by adjusting hyper-parameters and/or models/transformers and creating some new features.

[ ]: