GraphPipeline getting started¶

This notebook is here to show a few things that can be done by the package.

It doesn’t means that these are the things you should do on that particular dataset.

Let’s load titanic dataset to test a few things

[1]:

import warnings
warnings.filterwarnings('ignore') # to remove gensim warning

[2]:

from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)

[3]:

Xtrain.head(20)

[3]:

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home_dest
0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S	NaN	175.0	Dorchester, MA
1	1	Fortune, Mr. Mark	male	64.0	1	4	19950	263.0000	C23 C25 C27	S	NaN	NaN	Winnipeg, MB
2	1	Sagesser, Mlle. Emma	female	24.0	0	0	PC 17477	69.3000	B35	C	9	NaN	NaN
3	3	Panula, Master. Urho Abraham	male	2.0	4	1	3101295	39.6875	NaN	S	NaN	NaN	NaN
4	1	Maioni, Miss. Roberta	female	16.0	0	0	110152	86.5000	B79	S	8	NaN	NaN
5	3	Waelens, Mr. Achille	male	22.0	0	0	345767	9.0000	NaN	S	NaN	NaN	Antwerp, Belgium / Stanton, OH
6	3	Reed, Mr. James George	male	NaN	0	0	362316	7.2500	NaN	S	NaN	NaN	NaN
7	1	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	0	0	17466	25.9292	D17	S	8	NaN	Brooklyn, NY
8	1	Smith, Mrs. Lucien Philip (Mary Eloise Hughes)	female	18.0	1	0	13695	60.0000	C31	S	6	NaN	Huntington, WV
9	1	Rowe, Mr. Alfred G	male	33.0	0	0	113790	26.5500	NaN	S	NaN	109.0	London
10	3	Meo, Mr. Alfonzo	male	55.5	0	0	A.5. 11206	8.0500	NaN	S	NaN	201.0	NaN
11	3	Abbott, Mr. Rossmore Edward	male	16.0	1	1	C.A. 2673	20.2500	NaN	S	NaN	190.0	East Providence, RI
12	3	Elias, Mr. Dibo	male	NaN	0	0	2674	7.2250	NaN	C	NaN	NaN	NaN
13	2	Reynaldo, Ms. Encarnacion	female	28.0	0	0	230434	13.0000	NaN	S	9	NaN	Spain
14	3	Khalil, Mr. Betros	male	NaN	1	0	2660	14.4542	NaN	C	NaN	NaN	NaN
15	1	Daniels, Miss. Sarah	female	33.0	0	0	113781	151.5500	NaN	S	8	NaN	NaN
16	3	Ford, Miss. Robina Maggie 'Ruby'	female	9.0	2	2	W./C. 6608	34.3750	NaN	S	NaN	NaN	Rotherfield, Sussex, England Essex Co, MA
17	3	Thorneycroft, Mrs. Percival (Florence Kate White)	female	NaN	1	0	376564	16.1000	NaN	S	10	NaN	NaN
18	3	Lennon, Mr. Denis	male	NaN	1	0	370371	15.5000	NaN	Q	NaN	NaN	NaN
19	3	de Pelsmaeker, Mr. Alfons	male	16.0	0	0	345778	9.5000	NaN	S	NaN	NaN	NaN

[4]:

y_train[0:20]

[4]:

array([0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0],
      dtype=int64)

For now let’s ignore the Name and Ticket column which should probably be handled as text

[5]:

import pandas as pd
from aikit.transformers import TruncatedSVDWrapper, NumImputer, CountVectorizerWrapper, NumericalEncoder
from aikit.pipeline import GraphPipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

Matplotlib won't work

[6]:

non_text_cols = [c for c in Xtrain.columns if c not in ("ticket","name")] # everything that is not text
non_text_cols

[6]:

['pclass',
 'sex',
 'age',
 'sibsp',
 'parch',
 'fare',
 'cabin',
 'embarked',
 'boat',
 'body',
 'home_dest']

[7]:

gpipeline = GraphPipeline(models = { "enc":NumericalEncoder(),
                                     "imp":NumImputer(),
                                     "forest":RandomForestClassifier(n_estimators=100)
                                   },
                          edges = [("enc","imp","forest")])

gpipeline.fit(Xtrain.loc[:,non_text_cols],y_train)
gpipeline.graphviz

[7]:

../_images/notebooks_GraphPipeline_9_0.svg

Let’s do a cross-validation¶

[8]:

from aikit.cross_validation import cross_validation
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(10, random_state=123, shuffle=True)

cv_result = cross_validation(gpipeline, Xtrain.loc[:,non_text_cols], y_train,cv = cv,
                             scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    3.8s finished

[8]:

	test_roc_auc	test_accuracy	test_neg_log_loss	train_roc_auc	train_accuracy	train_neg_log_loss	fit_time	score_time	n_test_samples	fold_nb
0	0.997332	0.990476	-0.050391	0.999830	0.995758	-0.029559	0.200567	0.076803	105	0
1	0.968369	0.961905	-0.723250	0.999986	0.997879	-0.022651	0.192597	0.066856	105	1
2	0.983232	0.942857	-0.154483	0.999816	0.995758	-0.026256	0.200526	0.069852	105	2
3	1.000000	1.000000	-0.035742	0.999707	0.995758	-0.030825	0.210022	0.070714	105	3
4	0.996380	0.961905	-0.088300	0.999802	0.995758	-0.028642	0.199074	0.064868	105	4
5	0.991806	0.952381	-0.125793	0.999797	0.997879	-0.025816	0.193644	0.070361	105	5
6	1.000000	1.000000	-0.040940	0.999703	0.995758	-0.029609	0.215009	0.065831	105	6
7	0.996380	0.980952	-0.088508	0.999842	0.996819	-0.026614	0.184709	0.077791	105	7
8	0.992838	0.971154	-0.107017	0.999793	0.995763	-0.027610	0.187533	0.063796	104	8
9	0.999613	0.980769	-0.072026	0.999764	0.995763	-0.028887	0.199505	0.062847	104	9

This cross-validate the complete Pipeline. The difference with sklearn function is that : * you can score more than one metric at a time * you retrieve train and test score

[9]:

cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()

[9]:

test_roc_auc         0.992595
test_accuracy        0.974240
test_neg_log_loss   -0.148645
dtype: float64

We can do the same but selecting the columns directly in the pipeline :

[10]:

from aikit.transformers import ColumnsSelector
gpipeline2 = GraphPipeline(models = { "sel":ColumnsSelector(columns_to_use=non_text_cols),
                                      "enc":NumericalEncoder(columns_to_use="object"),
                                      "imp":NumImputer(),
                                      "forest":RandomForestClassifier(n_estimators=100, random_state=123)
                                    },
                         edges = [("sel","enc","imp","forest")])

gpipeline2.fit(Xtrain,y_train)
gpipeline2.graphviz

[10]:

../_images/notebooks_GraphPipeline_15_0.svg

Remark : ‘columns_to_use=”object”’ tells aikit to encode the columns of type object, it will keep the rest untouched¶

[11]:

cv_result = cross_validation(gpipeline2,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    4.0s finished

[11]:

test_roc_auc         0.991698
test_accuracy        0.972335
test_neg_log_loss   -0.178280
dtype: float64

Now let’s see what we can do with the columns we excluded. We could craft features from them, but let’s try to use them as text directly.

[12]:

text_cols = ["ticket","name"]
vect = CountVectorizerWrapper(analyzer="word", columns_to_use=text_cols)
vect.fit(Xtrain,y_train)

[12]:

CountVectorizerWrapper(analyzer='word', column_prefix='BAG',
                       columns_to_use=['ticket', 'name'],
                       desired_output_type='SparseArray',
                       drop_unused_columns=True, drop_used_columns=True,
                       max_df=1.0, max_features=None, min_df=1, ngram_range=1,
                       regex_match=False, tfidf=False, vocabulary=None)

Remark : aikit CountVectorizer can direcly work on 2 (or more) columns, no need to use a FeatureUnion or something of the sort¶

[13]:

features = vect.get_feature_names()
features[0:20] + ["..."] + features[-20:]

[13]:

['ticket__BAG__10482',
 'ticket__BAG__110152',
 'ticket__BAG__110413',
 'ticket__BAG__110465',
 'ticket__BAG__110469',
 'ticket__BAG__110489',
 'ticket__BAG__110564',
 'ticket__BAG__110813',
 'ticket__BAG__111163',
 'ticket__BAG__111240',
 'ticket__BAG__111320',
 'ticket__BAG__111361',
 'ticket__BAG__111369',
 'ticket__BAG__111426',
 'ticket__BAG__111427',
 'ticket__BAG__112050',
 'ticket__BAG__112052',
 'ticket__BAG__112053',
 'ticket__BAG__112058',
 'ticket__BAG__11206',
 '...',
 'name__BAG__woolf',
 'name__BAG__woolner',
 'name__BAG__worth',
 'name__BAG__wright',
 'name__BAG__wyckoff',
 'name__BAG__yarred',
 'name__BAG__yasbeck',
 'name__BAG__ylio',
 'name__BAG__yoto',
 'name__BAG__young',
 'name__BAG__youseff',
 'name__BAG__yousif',
 'name__BAG__youssef',
 'name__BAG__yousseff',
 'name__BAG__yrois',
 'name__BAG__zabour',
 'name__BAG__zakarian',
 'name__BAG__zebley',
 'name__BAG__zenni',
 'name__BAG__zillah']

The encoder directly encodes the 2 features

[14]:

xx_res = vect.transform(Xtrain)
xx_res

[14]:

<1048x2440 sparse matrix of type '<class 'numpy.int32'>'
        with 5414 stored elements in COOrdinate format>

Again let’s create a GraphPipeline to cross-validate

[15]:

gpipeline3 = GraphPipeline(models = {"vect":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
                                     "logit":LogisticRegression(solver="liblinear", random_state=123)},
                           edges=[("vect","logit")])
gpipeline3.fit(Xtrain,y_train)
gpipeline3.graphviz

[15]:

../_images/notebooks_GraphPipeline_25_0.svg

[16]:

cv_result = cross_validation(gpipeline3, Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.9s finished

[16]:

test_roc_auc         0.850918
test_accuracy        0.819679
test_neg_log_loss   -0.451681
dtype: float64

We can also try we “bag of char”

[17]:

gpipeline4 = GraphPipeline(models = {
        "vect": CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
        "logit": LogisticRegression(solver="liblinear", random_state=123) }, edges=[("vect","logit")])
gpipeline4.fit(Xtrain,y_train)
gpipeline4.graphviz

[17]:

../_images/notebooks_GraphPipeline_28_0.svg

[18]:

cv_result = cross_validation(gpipeline4,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    5.9s finished

[18]:

test_roc_auc         0.849773
test_accuracy        0.813956
test_neg_log_loss   -0.559254
dtype: float64

Now let’s use all the columns¶

[19]:

gpipeline5 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),("vect","rf")])
gpipeline5.fit(Xtrain,y_train)
gpipeline5.graphviz

[19]:

../_images/notebooks_GraphPipeline_31_0.svg

This model uses both set of columns: * bag of word * and categorical/numerical features

[20]:

cv_result = cross_validation(gpipeline5,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   11.0s finished

[20]:

test_roc_auc         0.992779
test_accuracy        0.968507
test_neg_log_loss   -0.173236
dtype: float64

We can also use both Bag of Char and Bag of Word

[21]:

gpipeline6 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_char":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_word":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),("vect_char","rf"),("vect_word","rf")])
gpipeline6.fit(Xtrain,y_train)
gpipeline6.graphviz

[21]:

../_images/notebooks_GraphPipeline_35_0.svg

[22]:

cv_result = cross_validation(gpipeline6,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   13.9s finished

[22]:

test_roc_auc         0.947360
test_accuracy        0.843516
test_neg_log_loss   -0.325666
dtype: float64

Maybe we can try SVD to limit dimension of bag of char/word features

[23]:

gpipeline7 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "svd":TruncatedSVDWrapper(n_components=100, random_state=123),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel", "enc","imp","rf"),("vect_word","svd","rf"),("vect_char","svd","rf")])
gpipeline7.fit(Xtrain,y_train)
gpipeline7.graphviz

[23]:

../_images/notebooks_GraphPipeline_38_0.svg

[24]:

cv_result = cross_validation(gpipeline7,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   23.4s finished

[24]:

test_roc_auc         0.992953
test_accuracy        0.972326
test_neg_log_loss   -0.167037
dtype: float64

We can even add ‘SVD’ columns AND bag of word/char columns

[25]:

gpipeline8 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "svd":TruncatedSVDWrapper(n_components=100, random_state=123),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
            edges = [("sel","enc","imp","rf"),("vect_word","svd","rf"),("vect_char","svd","rf"),("vect_word","rf"),("vect_char","rf")])

gpipeline8.graphviz

[25]:

../_images/notebooks_GraphPipeline_41_0.svg

[26]:

cv_result = cross_validation(gpipeline8,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   22.1s finished

[26]:

test_roc_auc         0.941329
test_accuracy        0.834011
test_neg_log_loss   -0.334545
dtype: float64

Instead of ‘SVD’ we can add a layer that filter columns…

[27]:

from aikit.transformers import FeaturesSelectorClassifier

[28]:

gpipeline9 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "selector":FeaturesSelectorClassifier(n_components=20),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),("vect_word","selector","rf"),("vect_char","selector","rf")])

gpipeline9.graphviz

[28]:

../_images/notebooks_GraphPipeline_45_0.svg

Retrieve feature importance¶

Let’s use that complicated example to show how to retrieve the feature importance

[29]:

gpipeline9.fit(Xtrain, y_train)

df_imp = pd.Series(gpipeline9.models["rf"].feature_importances_,
                  index = gpipeline9.get_input_features_at_node("rf"))
df_imp.sort_values(ascending=False,inplace=True)
df_imp

[29]:

boat____null__             3.839758e-01
sex__female                3.816301e-02
name__BAG__mr              3.715979e-02
name__BAG__mr.             3.636483e-02
fare                       3.419880e-02
name__BAG__mr.             3.133609e-02
sex__male                  2.962421e-02
name__BAG__r.              2.910019e-02
name__BAG__s.              2.776609e-02
boat__15                   2.672268e-02
age                        2.643157e-02
name__BAG__s.              2.500470e-02
name__BAG__ mr.            2.249752e-02
boat__13                   1.863079e-02
boat____default__          1.711391e-02
pclass                     1.665125e-02
name__BAG__                1.597853e-02
sibsp                      1.524516e-02
home_dest____null__        1.015056e-02
boat__7                    9.817018e-03
home_dest____default__     9.534058e-03
boat__C                    9.453317e-03
cabin____null__            8.265959e-03
cabin____default__         7.290940e-03
parch                      7.138940e-03
embarked__S                6.643220e-03
boat__5                    6.206360e-03
name__BAG__iss.            6.139824e-03
embarked__C                6.040638e-03
boat__3                    5.547742e-03
name__BAG__(               5.352397e-03
name__BAG__mr              5.260205e-03
body_isnull                4.829877e-03
name__BAG__ (              4.360392e-03
boat__16                   4.245866e-03
boat__9                    4.224166e-03
boat__D                    4.194419e-03
name__BAG__ss              4.076246e-03
embarked__Q                4.047912e-03
name__BAG__mrs             3.602001e-03
body                       2.955222e-03
name__BAG__rs              2.899086e-03
name__BAG__rs.             2.869114e-03
age_isnull                 2.859144e-03
boat__14                   2.809765e-03
boat__10                   2.695927e-03
name__BAG__rs.             2.165917e-03
boat__12                   2.103210e-03
name__BAG__mrs.            2.064884e-03
home_dest__New York, NY    1.799501e-03
boat__11                   1.495248e-03
name__BAG__miss            1.054318e-03
name__BAG__mrs             9.616334e-04
boat__4                    9.420111e-04
boat__6                    8.184419e-04
home_dest__London          7.515602e-04
boat__8                    3.679950e-04
fare_isnull                2.438310e-09
dtype: float64

[30]:

cv_result = cross_validation(gpipeline9,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   15.0s finished

[30]:

test_roc_auc         0.994108
test_accuracy        0.973288
test_neg_log_loss   -0.153255
dtype: float64

[31]:

gpipeline10 = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
    "svd":TruncatedSVDWrapper(n_components=10),
    "selector":FeaturesSelectorClassifier(n_components=10, random_state=123),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),
                       ("vect_word","selector","rf"),
                       ("vect_char","selector","rf"),
                       ("vect_word","svd","rf"),
                       ("vect_char","svd","rf")])

gpipeline10.fit(Xtrain,y_train)
gpipeline10.graphviz

[31]:

../_images/notebooks_GraphPipeline_49_0.svg

In this model here is what is done : * categorical columns are encoded (‘enc’) * missing values are filled (‘imp’) * bag of word and bag of char are created, for the two text features * an SVD is done on those * a selector is called to select most important bag of word/char features * everything is given to a RandomForest

[32]:

cv_result = cross_validation(gpipeline10,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   16.9s finished

[32]:

test_roc_auc         0.994333
test_accuracy        0.975201
test_neg_log_loss   -0.143788
dtype: float64

As we saw the GraphPipeline allow flexibility in the creation of models and several choices can be easily tested.

Again, it is not the best possible choices for that database, the example are here to illustrate the capabilities.

Better score could be obtained by adjusting hyper-parameters and/or models/transformers and creating some new features.

[ ]: