GraphPipeline getting started¶
This notebook is here to show a few things that can be done by the package.
It doesn’t means that these are the things you should do on that particular dataset.
Let’s load titanic dataset to test a few things
[1]:
import warnings
warnings.filterwarnings('ignore') # to remove gensim warning
[2]:
from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
[3]:
Xtrain.head(20)
[3]:
pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home_dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | NaN | 175.0 | Dorchester, MA |
1 | 1 | Fortune, Mr. Mark | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S | NaN | NaN | Winnipeg, MB |
2 | 1 | Sagesser, Mlle. Emma | female | 24.0 | 0 | 0 | PC 17477 | 69.3000 | B35 | C | 9 | NaN | NaN |
3 | 3 | Panula, Master. Urho Abraham | male | 2.0 | 4 | 1 | 3101295 | 39.6875 | NaN | S | NaN | NaN | NaN |
4 | 1 | Maioni, Miss. Roberta | female | 16.0 | 0 | 0 | 110152 | 86.5000 | B79 | S | 8 | NaN | NaN |
5 | 3 | Waelens, Mr. Achille | male | 22.0 | 0 | 0 | 345767 | 9.0000 | NaN | S | NaN | NaN | Antwerp, Belgium / Stanton, OH |
6 | 3 | Reed, Mr. James George | male | NaN | 0 | 0 | 362316 | 7.2500 | NaN | S | NaN | NaN | NaN |
7 | 1 | Swift, Mrs. Frederick Joel (Margaret Welles Ba... | female | 48.0 | 0 | 0 | 17466 | 25.9292 | D17 | S | 8 | NaN | Brooklyn, NY |
8 | 1 | Smith, Mrs. Lucien Philip (Mary Eloise Hughes) | female | 18.0 | 1 | 0 | 13695 | 60.0000 | C31 | S | 6 | NaN | Huntington, WV |
9 | 1 | Rowe, Mr. Alfred G | male | 33.0 | 0 | 0 | 113790 | 26.5500 | NaN | S | NaN | 109.0 | London |
10 | 3 | Meo, Mr. Alfonzo | male | 55.5 | 0 | 0 | A.5. 11206 | 8.0500 | NaN | S | NaN | 201.0 | NaN |
11 | 3 | Abbott, Mr. Rossmore Edward | male | 16.0 | 1 | 1 | C.A. 2673 | 20.2500 | NaN | S | NaN | 190.0 | East Providence, RI |
12 | 3 | Elias, Mr. Dibo | male | NaN | 0 | 0 | 2674 | 7.2250 | NaN | C | NaN | NaN | NaN |
13 | 2 | Reynaldo, Ms. Encarnacion | female | 28.0 | 0 | 0 | 230434 | 13.0000 | NaN | S | 9 | NaN | Spain |
14 | 3 | Khalil, Mr. Betros | male | NaN | 1 | 0 | 2660 | 14.4542 | NaN | C | NaN | NaN | NaN |
15 | 1 | Daniels, Miss. Sarah | female | 33.0 | 0 | 0 | 113781 | 151.5500 | NaN | S | 8 | NaN | NaN |
16 | 3 | Ford, Miss. Robina Maggie 'Ruby' | female | 9.0 | 2 | 2 | W./C. 6608 | 34.3750 | NaN | S | NaN | NaN | Rotherfield, Sussex, England Essex Co, MA |
17 | 3 | Thorneycroft, Mrs. Percival (Florence Kate White) | female | NaN | 1 | 0 | 376564 | 16.1000 | NaN | S | 10 | NaN | NaN |
18 | 3 | Lennon, Mr. Denis | male | NaN | 1 | 0 | 370371 | 15.5000 | NaN | Q | NaN | NaN | NaN |
19 | 3 | de Pelsmaeker, Mr. Alfons | male | 16.0 | 0 | 0 | 345778 | 9.5000 | NaN | S | NaN | NaN | NaN |
[4]:
y_train[0:20]
[4]:
array([0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0],
dtype=int64)
For now let’s ignore the Name and Ticket column which should probably be handled as text
[5]:
import pandas as pd
from aikit.transformers import TruncatedSVDWrapper, NumImputer, CountVectorizerWrapper, NumericalEncoder
from aikit.pipeline import GraphPipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
Matplotlib won't work
[6]:
non_text_cols = [c for c in Xtrain.columns if c not in ("ticket","name")] # everything that is not text
non_text_cols
[6]:
['pclass',
'sex',
'age',
'sibsp',
'parch',
'fare',
'cabin',
'embarked',
'boat',
'body',
'home_dest']
[7]:
gpipeline = GraphPipeline(models = { "enc":NumericalEncoder(),
"imp":NumImputer(),
"forest":RandomForestClassifier(n_estimators=100)
},
edges = [("enc","imp","forest")])
gpipeline.fit(Xtrain.loc[:,non_text_cols],y_train)
gpipeline.graphviz
[7]:
Let’s do a cross-validation¶
[8]:
from aikit.cross_validation import cross_validation
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(10, random_state=123, shuffle=True)
cv_result = cross_validation(gpipeline, Xtrain.loc[:,non_text_cols], y_train,cv = cv,
scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started
cv 1 started
cv 2 started
cv 3 started
cv 4 started
cv 5 started
cv 6 started
cv 7 started
cv 8 started
cv 9 started
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 3.8s finished
[8]:
test_roc_auc | test_accuracy | test_neg_log_loss | train_roc_auc | train_accuracy | train_neg_log_loss | fit_time | score_time | n_test_samples | fold_nb | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.997332 | 0.990476 | -0.050391 | 0.999830 | 0.995758 | -0.029559 | 0.200567 | 0.076803 | 105 | 0 |
1 | 0.968369 | 0.961905 | -0.723250 | 0.999986 | 0.997879 | -0.022651 | 0.192597 | 0.066856 | 105 | 1 |
2 | 0.983232 | 0.942857 | -0.154483 | 0.999816 | 0.995758 | -0.026256 | 0.200526 | 0.069852 | 105 | 2 |
3 | 1.000000 | 1.000000 | -0.035742 | 0.999707 | 0.995758 | -0.030825 | 0.210022 | 0.070714 | 105 | 3 |
4 | 0.996380 | 0.961905 | -0.088300 | 0.999802 | 0.995758 | -0.028642 | 0.199074 | 0.064868 | 105 | 4 |
5 | 0.991806 | 0.952381 | -0.125793 | 0.999797 | 0.997879 | -0.025816 | 0.193644 | 0.070361 | 105 | 5 |
6 | 1.000000 | 1.000000 | -0.040940 | 0.999703 | 0.995758 | -0.029609 | 0.215009 | 0.065831 | 105 | 6 |
7 | 0.996380 | 0.980952 | -0.088508 | 0.999842 | 0.996819 | -0.026614 | 0.184709 | 0.077791 | 105 | 7 |
8 | 0.992838 | 0.971154 | -0.107017 | 0.999793 | 0.995763 | -0.027610 | 0.187533 | 0.063796 | 104 | 8 |
9 | 0.999613 | 0.980769 | -0.072026 | 0.999764 | 0.995763 | -0.028887 | 0.199505 | 0.062847 | 104 | 9 |
This cross-validate the complete Pipeline. The difference with sklearn function is that : * you can score more than one metric at a time * you retrieve train and test score
[9]:
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[9]:
test_roc_auc 0.992595
test_accuracy 0.974240
test_neg_log_loss -0.148645
dtype: float64
We can do the same but selecting the columns directly in the pipeline :
[10]:
from aikit.transformers import ColumnsSelector
gpipeline2 = GraphPipeline(models = { "sel":ColumnsSelector(columns_to_use=non_text_cols),
"enc":NumericalEncoder(columns_to_use="object"),
"imp":NumImputer(),
"forest":RandomForestClassifier(n_estimators=100, random_state=123)
},
edges = [("sel","enc","imp","forest")])
gpipeline2.fit(Xtrain,y_train)
gpipeline2.graphviz
[10]:
Remark : ‘columns_to_use=”object”’ tells aikit to encode the columns of type object, it will keep the rest untouched¶
[11]:
cv_result = cross_validation(gpipeline2,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started
cv 1 started
cv 2 started
cv 3 started
cv 4 started
cv 5 started
cv 6 started
cv 7 started
cv 8 started
cv 9 started
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 4.0s finished
[11]:
test_roc_auc 0.991698
test_accuracy 0.972335
test_neg_log_loss -0.178280
dtype: float64
Now let’s see what we can do with the columns we excluded. We could craft features from them, but let’s try to use them as text directly.
[12]:
text_cols = ["ticket","name"]
vect = CountVectorizerWrapper(analyzer="word", columns_to_use=text_cols)
vect.fit(Xtrain,y_train)
[12]:
CountVectorizerWrapper(analyzer='word', column_prefix='BAG',
columns_to_use=['ticket', 'name'],
desired_output_type='SparseArray',
drop_unused_columns=True, drop_used_columns=True,
max_df=1.0, max_features=None, min_df=1, ngram_range=1,
regex_match=False, tfidf=False, vocabulary=None)
Remark : aikit CountVectorizer can direcly work on 2 (or more) columns, no need to use a FeatureUnion or something of the sort¶
[13]:
features = vect.get_feature_names()
features[0:20] + ["..."] + features[-20:]
[13]:
['ticket__BAG__10482',
'ticket__BAG__110152',
'ticket__BAG__110413',
'ticket__BAG__110465',
'ticket__BAG__110469',
'ticket__BAG__110489',
'ticket__BAG__110564',
'ticket__BAG__110813',
'ticket__BAG__111163',
'ticket__BAG__111240',
'ticket__BAG__111320',
'ticket__BAG__111361',
'ticket__BAG__111369',
'ticket__BAG__111426',
'ticket__BAG__111427',
'ticket__BAG__112050',
'ticket__BAG__112052',
'ticket__BAG__112053',
'ticket__BAG__112058',
'ticket__BAG__11206',
'...',
'name__BAG__woolf',
'name__BAG__woolner',
'name__BAG__worth',
'name__BAG__wright',
'name__BAG__wyckoff',
'name__BAG__yarred',
'name__BAG__yasbeck',
'name__BAG__ylio',
'name__BAG__yoto',
'name__BAG__young',
'name__BAG__youseff',
'name__BAG__yousif',
'name__BAG__youssef',
'name__BAG__yousseff',
'name__BAG__yrois',
'name__BAG__zabour',
'name__BAG__zakarian',
'name__BAG__zebley',
'name__BAG__zenni',
'name__BAG__zillah']
The encoder directly encodes the 2 features
[14]:
xx_res = vect.transform(Xtrain)
xx_res
[14]:
<1048x2440 sparse matrix of type '<class 'numpy.int32'>'
with 5414 stored elements in COOrdinate format>
Again let’s create a GraphPipeline to cross-validate
[15]:
gpipeline3 = GraphPipeline(models = {"vect":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
"logit":LogisticRegression(solver="liblinear", random_state=123)},
edges=[("vect","logit")])
gpipeline3.fit(Xtrain,y_train)
gpipeline3.graphviz
[15]:
[16]:
cv_result = cross_validation(gpipeline3, Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started
cv 1 started
cv 2 started
cv 3 started
cv 4 started
cv 5 started
cv 6 started
cv 7 started
cv 8 started
cv 9 started
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.9s finished
[16]:
test_roc_auc 0.850918
test_accuracy 0.819679
test_neg_log_loss -0.451681
dtype: float64
We can also try we “bag of char”
[17]:
gpipeline4 = GraphPipeline(models = {
"vect": CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
"logit": LogisticRegression(solver="liblinear", random_state=123) }, edges=[("vect","logit")])
gpipeline4.fit(Xtrain,y_train)
gpipeline4.graphviz
[17]:
[18]:
cv_result = cross_validation(gpipeline4,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started
cv 1 started
cv 2 started
cv 3 started
cv 4 started
cv 5 started
cv 6 started
cv 7 started
cv 8 started
cv 9 started
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 5.9s finished
[18]:
test_roc_auc 0.849773
test_accuracy 0.813956
test_neg_log_loss -0.559254
dtype: float64
Now let’s use all the columns¶
[19]:
gpipeline5 = GraphPipeline(models = {
"sel":ColumnsSelector(columns_to_use=non_text_cols),
"enc":NumericalEncoder(columns_to_use="object"),
"imp":NumImputer(),
"vect":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
"rf":RandomForestClassifier(n_estimators=100, random_state=123)
},
edges = [("sel","enc","imp","rf"),("vect","rf")])
gpipeline5.fit(Xtrain,y_train)
gpipeline5.graphviz
[19]:
This model uses both set of columns: * bag of word * and categorical/numerical features
[20]:
cv_result = cross_validation(gpipeline5,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started
cv 1 started
cv 2 started
cv 3 started
cv 4 started
cv 5 started
cv 6 started
cv 7 started
cv 8 started
cv 9 started
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 11.0s finished
[20]:
test_roc_auc 0.992779
test_accuracy 0.968507
test_neg_log_loss -0.173236
dtype: float64
We can also use both Bag of Char and Bag of Word
[21]:
gpipeline6 = GraphPipeline(models = {
"sel":ColumnsSelector(columns_to_use=non_text_cols),
"enc":NumericalEncoder(columns_to_use="object"),
"imp":NumImputer(),
"vect_char":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
"vect_word":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
"rf":RandomForestClassifier(n_estimators=100, random_state=123)
},
edges = [("sel","enc","imp","rf"),("vect_char","rf"),("vect_word","rf")])
gpipeline6.fit(Xtrain,y_train)
gpipeline6.graphviz
[21]:
[22]:
cv_result = cross_validation(gpipeline6,Xtrain,y_train,cv = cv,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started
cv 1 started
cv 2 started
cv 3 started
cv 4 started
cv 5 started
cv 6 started
cv 7 started
cv 8 started
cv 9 started
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 13.9s finished
[22]:
test_roc_auc 0.947360
test_accuracy 0.843516
test_neg_log_loss -0.325666
dtype: float64
Maybe we can try SVD to limit dimension of bag of char/word features
[23]:
gpipeline7 = GraphPipeline(models = {
"sel":ColumnsSelector(columns_to_use=non_text_cols),
"enc":NumericalEncoder(columns_to_use="object"),
"imp":NumImputer(),
"vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
"vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
"svd":TruncatedSVDWrapper(n_components=100, random_state=123),
"rf":RandomForestClassifier(n_estimators=100, random_state=123)
},
edges = [("sel", "enc","imp","rf"),("vect_word","svd","rf"),("vect_char","svd","rf")])
gpipeline7.fit(Xtrain,y_train)
gpipeline7.graphviz
[23]:
[24]:
cv_result = cross_validation(gpipeline7,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started
cv 1 started
cv 2 started
cv 3 started
cv 4 started
cv 5 started
cv 6 started
cv 7 started
cv 8 started
cv 9 started
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 23.4s finished
[24]:
test_roc_auc 0.992953
test_accuracy 0.972326
test_neg_log_loss -0.167037
dtype: float64
We can even add ‘SVD’ columns AND bag of word/char columns
[25]:
gpipeline8 = GraphPipeline(models = {
"sel":ColumnsSelector(columns_to_use=non_text_cols),
"enc":NumericalEncoder(columns_to_use="object"),
"imp":NumImputer(),
"vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
"vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
"svd":TruncatedSVDWrapper(n_components=100, random_state=123),
"rf":RandomForestClassifier(n_estimators=100, random_state=123)
},
edges = [("sel","enc","imp","rf"),("vect_word","svd","rf"),("vect_char","svd","rf"),("vect_word","rf"),("vect_char","rf")])
gpipeline8.graphviz
[25]:
[26]:
cv_result = cross_validation(gpipeline8,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started
cv 1 started
cv 2 started
cv 3 started
cv 4 started
cv 5 started
cv 6 started
cv 7 started
cv 8 started
cv 9 started
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 22.1s finished
[26]:
test_roc_auc 0.941329
test_accuracy 0.834011
test_neg_log_loss -0.334545
dtype: float64
Instead of ‘SVD’ we can add a layer that filter columns…
[27]:
from aikit.transformers import FeaturesSelectorClassifier
[28]:
gpipeline9 = GraphPipeline(models = {
"sel":ColumnsSelector(columns_to_use=non_text_cols),
"enc":NumericalEncoder(columns_to_use="object"),
"imp":NumImputer(),
"vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
"vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
"selector":FeaturesSelectorClassifier(n_components=20),
"rf":RandomForestClassifier(n_estimators=100, random_state=123)
},
edges = [("sel","enc","imp","rf"),("vect_word","selector","rf"),("vect_char","selector","rf")])
gpipeline9.graphviz
[28]:
Retrieve feature importance¶
Let’s use that complicated example to show how to retrieve the feature importance
[29]:
gpipeline9.fit(Xtrain, y_train)
df_imp = pd.Series(gpipeline9.models["rf"].feature_importances_,
index = gpipeline9.get_input_features_at_node("rf"))
df_imp.sort_values(ascending=False,inplace=True)
df_imp
[29]:
boat____null__ 3.839758e-01
sex__female 3.816301e-02
name__BAG__mr 3.715979e-02
name__BAG__mr. 3.636483e-02
fare 3.419880e-02
name__BAG__mr. 3.133609e-02
sex__male 2.962421e-02
name__BAG__r. 2.910019e-02
name__BAG__s. 2.776609e-02
boat__15 2.672268e-02
age 2.643157e-02
name__BAG__s. 2.500470e-02
name__BAG__ mr. 2.249752e-02
boat__13 1.863079e-02
boat____default__ 1.711391e-02
pclass 1.665125e-02
name__BAG__ 1.597853e-02
sibsp 1.524516e-02
home_dest____null__ 1.015056e-02
boat__7 9.817018e-03
home_dest____default__ 9.534058e-03
boat__C 9.453317e-03
cabin____null__ 8.265959e-03
cabin____default__ 7.290940e-03
parch 7.138940e-03
embarked__S 6.643220e-03
boat__5 6.206360e-03
name__BAG__iss. 6.139824e-03
embarked__C 6.040638e-03
boat__3 5.547742e-03
name__BAG__( 5.352397e-03
name__BAG__mr 5.260205e-03
body_isnull 4.829877e-03
name__BAG__ ( 4.360392e-03
boat__16 4.245866e-03
boat__9 4.224166e-03
boat__D 4.194419e-03
name__BAG__ss 4.076246e-03
embarked__Q 4.047912e-03
name__BAG__mrs 3.602001e-03
body 2.955222e-03
name__BAG__rs 2.899086e-03
name__BAG__rs. 2.869114e-03
age_isnull 2.859144e-03
boat__14 2.809765e-03
boat__10 2.695927e-03
name__BAG__rs. 2.165917e-03
boat__12 2.103210e-03
name__BAG__mrs. 2.064884e-03
home_dest__New York, NY 1.799501e-03
boat__11 1.495248e-03
name__BAG__miss 1.054318e-03
name__BAG__mrs 9.616334e-04
boat__4 9.420111e-04
boat__6 8.184419e-04
home_dest__London 7.515602e-04
boat__8 3.679950e-04
fare_isnull 2.438310e-09
dtype: float64
[30]:
cv_result = cross_validation(gpipeline9,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started
cv 1 started
cv 2 started
cv 3 started
cv 4 started
cv 5 started
cv 6 started
cv 7 started
cv 8 started
cv 9 started
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 15.0s finished
[30]:
test_roc_auc 0.994108
test_accuracy 0.973288
test_neg_log_loss -0.153255
dtype: float64
[31]:
gpipeline10 = GraphPipeline(models = {
"sel":ColumnsSelector(columns_to_use=non_text_cols),
"enc":NumericalEncoder(columns_to_use="object"),
"imp":NumImputer(),
"vect_word":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
"vect_char":CountVectorizerWrapper(analyzer="char",ngram_range=(1,4),columns_to_use=text_cols),
"svd":TruncatedSVDWrapper(n_components=10),
"selector":FeaturesSelectorClassifier(n_components=10, random_state=123),
"rf":RandomForestClassifier(n_estimators=100, random_state=123)
},
edges = [("sel","enc","imp","rf"),
("vect_word","selector","rf"),
("vect_char","selector","rf"),
("vect_word","svd","rf"),
("vect_char","svd","rf")])
gpipeline10.fit(Xtrain,y_train)
gpipeline10.graphviz
[31]:
In this model here is what is done : * categorical columns are encoded (‘enc’) * missing values are filled (‘imp’) * bag of word and bag of char are created, for the two text features * an SVD is done on those * a selector is called to select most important bag of word/char features * everything is given to a RandomForest
[32]:
cv_result = cross_validation(gpipeline10,Xtrain,y_train,cv = 10,scoring=["roc_auc","accuracy","neg_log_loss"])
cv_result.loc[:,("test_roc_auc","test_accuracy","test_neg_log_loss")].mean()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
cv 0 started
cv 1 started
cv 2 started
cv 3 started
cv 4 started
cv 5 started
cv 6 started
cv 7 started
cv 8 started
cv 9 started
[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 16.9s finished
[32]:
test_roc_auc 0.994333
test_accuracy 0.975201
test_neg_log_loss -0.143788
dtype: float64
As we saw the GraphPipeline allow flexibility in the creation of models and several choices can be easily tested.
Again, it is not the best possible choices for that database, the example are here to illustrate the capabilities.
Better score could be obtained by adjusting hyper-parameters and/or models/transformers and creating some new features.
[ ]: