{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### GraphPipeline getting started ###\n",
"This notebook is here to show a few things that can be done by the package.\n",
"\n",
"It doesn't means that these are the things you should do on that particular dataset.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's load titanic dataset to test a few things"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore') # to remove gensim warning"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from aikit.datasets.datasets import load_dataset, DatasetEnum\n",
"Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" pclass \n",
" name \n",
" sex \n",
" age \n",
" sibsp \n",
" parch \n",
" ticket \n",
" fare \n",
" cabin \n",
" embarked \n",
" boat \n",
" body \n",
" home_dest \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 1 \n",
" McCarthy, Mr. Timothy J \n",
" male \n",
" 54.0 \n",
" 0 \n",
" 0 \n",
" 17463 \n",
" 51.8625 \n",
" E46 \n",
" S \n",
" NaN \n",
" 175.0 \n",
" Dorchester, MA \n",
" \n",
" \n",
" 1 \n",
" 1 \n",
" Fortune, Mr. Mark \n",
" male \n",
" 64.0 \n",
" 1 \n",
" 4 \n",
" 19950 \n",
" 263.0000 \n",
" C23 C25 C27 \n",
" S \n",
" NaN \n",
" NaN \n",
" Winnipeg, MB \n",
" \n",
" \n",
" 2 \n",
" 1 \n",
" Sagesser, Mlle. Emma \n",
" female \n",
" 24.0 \n",
" 0 \n",
" 0 \n",
" PC 17477 \n",
" 69.3000 \n",
" B35 \n",
" C \n",
" 9 \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 3 \n",
" 3 \n",
" Panula, Master. Urho Abraham \n",
" male \n",
" 2.0 \n",
" 4 \n",
" 1 \n",
" 3101295 \n",
" 39.6875 \n",
" NaN \n",
" S \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 4 \n",
" 1 \n",
" Maioni, Miss. Roberta \n",
" female \n",
" 16.0 \n",
" 0 \n",
" 0 \n",
" 110152 \n",
" 86.5000 \n",
" B79 \n",
" S \n",
" 8 \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 5 \n",
" 3 \n",
" Waelens, Mr. Achille \n",
" male \n",
" 22.0 \n",
" 0 \n",
" 0 \n",
" 345767 \n",
" 9.0000 \n",
" NaN \n",
" S \n",
" NaN \n",
" NaN \n",
" Antwerp, Belgium / Stanton, OH \n",
" \n",
" \n",
" 6 \n",
" 3 \n",
" Reed, Mr. James George \n",
" male \n",
" NaN \n",
" 0 \n",
" 0 \n",
" 362316 \n",
" 7.2500 \n",
" NaN \n",
" S \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 7 \n",
" 1 \n",
" Swift, Mrs. Frederick Joel (Margaret Welles Ba... \n",
" female \n",
" 48.0 \n",
" 0 \n",
" 0 \n",
" 17466 \n",
" 25.9292 \n",
" D17 \n",
" S \n",
" 8 \n",
" NaN \n",
" Brooklyn, NY \n",
" \n",
" \n",
" 8 \n",
" 1 \n",
" Smith, Mrs. Lucien Philip (Mary Eloise Hughes) \n",
" female \n",
" 18.0 \n",
" 1 \n",
" 0 \n",
" 13695 \n",
" 60.0000 \n",
" C31 \n",
" S \n",
" 6 \n",
" NaN \n",
" Huntington, WV \n",
" \n",
" \n",
" 9 \n",
" 1 \n",
" Rowe, Mr. Alfred G \n",
" male \n",
" 33.0 \n",
" 0 \n",
" 0 \n",
" 113790 \n",
" 26.5500 \n",
" NaN \n",
" S \n",
" NaN \n",
" 109.0 \n",
" London \n",
" \n",
" \n",
" 10 \n",
" 3 \n",
" Meo, Mr. Alfonzo \n",
" male \n",
" 55.5 \n",
" 0 \n",
" 0 \n",
" A.5. 11206 \n",
" 8.0500 \n",
" NaN \n",
" S \n",
" NaN \n",
" 201.0 \n",
" NaN \n",
" \n",
" \n",
" 11 \n",
" 3 \n",
" Abbott, Mr. Rossmore Edward \n",
" male \n",
" 16.0 \n",
" 1 \n",
" 1 \n",
" C.A. 2673 \n",
" 20.2500 \n",
" NaN \n",
" S \n",
" NaN \n",
" 190.0 \n",
" East Providence, RI \n",
" \n",
" \n",
" 12 \n",
" 3 \n",
" Elias, Mr. Dibo \n",
" male \n",
" NaN \n",
" 0 \n",
" 0 \n",
" 2674 \n",
" 7.2250 \n",
" NaN \n",
" C \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 13 \n",
" 2 \n",
" Reynaldo, Ms. Encarnacion \n",
" female \n",
" 28.0 \n",
" 0 \n",
" 0 \n",
" 230434 \n",
" 13.0000 \n",
" NaN \n",
" S \n",
" 9 \n",
" NaN \n",
" Spain \n",
" \n",
" \n",
" 14 \n",
" 3 \n",
" Khalil, Mr. Betros \n",
" male \n",
" NaN \n",
" 1 \n",
" 0 \n",
" 2660 \n",
" 14.4542 \n",
" NaN \n",
" C \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 15 \n",
" 1 \n",
" Daniels, Miss. Sarah \n",
" female \n",
" 33.0 \n",
" 0 \n",
" 0 \n",
" 113781 \n",
" 151.5500 \n",
" NaN \n",
" S \n",
" 8 \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 16 \n",
" 3 \n",
" Ford, Miss. Robina Maggie 'Ruby' \n",
" female \n",
" 9.0 \n",
" 2 \n",
" 2 \n",
" W./C. 6608 \n",
" 34.3750 \n",
" NaN \n",
" S \n",
" NaN \n",
" NaN \n",
" Rotherfield, Sussex, England Essex Co, MA \n",
" \n",
" \n",
" 17 \n",
" 3 \n",
" Thorneycroft, Mrs. Percival (Florence Kate White) \n",
" female \n",
" NaN \n",
" 1 \n",
" 0 \n",
" 376564 \n",
" 16.1000 \n",
" NaN \n",
" S \n",
" 10 \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 18 \n",
" 3 \n",
" Lennon, Mr. Denis \n",
" male \n",
" NaN \n",
" 1 \n",
" 0 \n",
" 370371 \n",
" 15.5000 \n",
" NaN \n",
" Q \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 19 \n",
" 3 \n",
" de Pelsmaeker, Mr. Alfons \n",
" male \n",
" 16.0 \n",
" 0 \n",
" 0 \n",
" 345778 \n",
" 9.5000 \n",
" NaN \n",
" S \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass name sex age \\\n",
"0 1 McCarthy, Mr. Timothy J male 54.0 \n",
"1 1 Fortune, Mr. Mark male 64.0 \n",
"2 1 Sagesser, Mlle. Emma female 24.0 \n",
"3 3 Panula, Master. Urho Abraham male 2.0 \n",
"4 1 Maioni, Miss. Roberta female 16.0 \n",
"5 3 Waelens, Mr. Achille male 22.0 \n",
"6 3 Reed, Mr. James George male NaN \n",
"7 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 \n",
"8 1 Smith, Mrs. Lucien Philip (Mary Eloise Hughes) female 18.0 \n",
"9 1 Rowe, Mr. Alfred G male 33.0 \n",
"10 3 Meo, Mr. Alfonzo male 55.5 \n",
"11 3 Abbott, Mr. Rossmore Edward male 16.0 \n",
"12 3 Elias, Mr. Dibo male NaN \n",
"13 2 Reynaldo, Ms. Encarnacion female 28.0 \n",
"14 3 Khalil, Mr. Betros male NaN \n",
"15 1 Daniels, Miss. Sarah female 33.0 \n",
"16 3 Ford, Miss. Robina Maggie 'Ruby' female 9.0 \n",
"17 3 Thorneycroft, Mrs. Percival (Florence Kate White) female NaN \n",
"18 3 Lennon, Mr. Denis male NaN \n",
"19 3 de Pelsmaeker, Mr. Alfons male 16.0 \n",
"\n",
" sibsp parch ticket fare cabin embarked boat body \\\n",
"0 0 0 17463 51.8625 E46 S NaN 175.0 \n",
"1 1 4 19950 263.0000 C23 C25 C27 S NaN NaN \n",
"2 0 0 PC 17477 69.3000 B35 C 9 NaN \n",
"3 4 1 3101295 39.6875 NaN S NaN NaN \n",
"4 0 0 110152 86.5000 B79 S 8 NaN \n",
"5 0 0 345767 9.0000 NaN S NaN NaN \n",
"6 0 0 362316 7.2500 NaN S NaN NaN \n",
"7 0 0 17466 25.9292 D17 S 8 NaN \n",
"8 1 0 13695 60.0000 C31 S 6 NaN \n",
"9 0 0 113790 26.5500 NaN S NaN 109.0 \n",
"10 0 0 A.5. 11206 8.0500 NaN S NaN 201.0 \n",
"11 1 1 C.A. 2673 20.2500 NaN S NaN 190.0 \n",
"12 0 0 2674 7.2250 NaN C NaN NaN \n",
"13 0 0 230434 13.0000 NaN S 9 NaN \n",
"14 1 0 2660 14.4542 NaN C NaN NaN \n",
"15 0 0 113781 151.5500 NaN S 8 NaN \n",
"16 2 2 W./C. 6608 34.3750 NaN S NaN NaN \n",
"17 1 0 376564 16.1000 NaN S 10 NaN \n",
"18 1 0 370371 15.5000 NaN Q NaN NaN \n",
"19 0 0 345778 9.5000 NaN S NaN NaN \n",
"\n",
" home_dest \n",
"0 Dorchester, MA \n",
"1 Winnipeg, MB \n",
"2 NaN \n",
"3 NaN \n",
"4 NaN \n",
"5 Antwerp, Belgium / Stanton, OH \n",
"6 NaN \n",
"7 Brooklyn, NY \n",
"8 Huntington, WV \n",
"9 London \n",
"10 NaN \n",
"11 East Providence, RI \n",
"12 NaN \n",
"13 Spain \n",
"14 NaN \n",
"15 NaN \n",
"16 Rotherfield, Sussex, England Essex Co, MA \n",
"17 NaN \n",
"18 NaN \n",
"19 NaN "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Xtrain.head(20)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0],\n",
" dtype=int64)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_train[0:20]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For now let's ignore the Name and Ticket column which should probably be handled as text"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Matplotlib won't work\n"
]
}
],
"source": [
"import pandas as pd\n",
"from aikit.transformers import TruncatedSVDWrapper, NumImputer, CountVectorizerWrapper, NumericalEncoder\n",
"from aikit.pipeline import GraphPipeline\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.linear_model import LogisticRegression"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['pclass',\n",
" 'sex',\n",
" 'age',\n",
" 'sibsp',\n",
" 'parch',\n",
" 'fare',\n",
" 'cabin',\n",
" 'embarked',\n",
" 'boat',\n",
" 'body',\n",
" 'home_dest']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"non_text_cols = [c for c in Xtrain.columns if c not in (\"ticket\",\"name\")] # everything that is not text\n",
"non_text_cols"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"%3 \r\n",
" \r\n",
"\r\n",
"imp \r\n",
"\r\n",
"imp \r\n",
" \r\n",
"\r\n",
"forest \r\n",
"\r\n",
"forest \r\n",
" \r\n",
"\r\n",
"imp->forest \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"enc \r\n",
"\r\n",
"enc \r\n",
" \r\n",
"\r\n",
"enc->imp \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n"
],
"text/plain": [
""
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gpipeline = GraphPipeline(models = { \"enc\":NumericalEncoder(),\n",
" \"imp\":NumImputer(),\n",
" \"forest\":RandomForestClassifier(n_estimators=100)\n",
" },\n",
" edges = [(\"enc\",\"imp\",\"forest\")])\n",
"\n",
"gpipeline.fit(Xtrain.loc[:,non_text_cols],y_train)\n",
"gpipeline.graphviz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Let's do a cross-validation"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 0 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 1 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 2 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 3 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 4 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 5 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 6 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 7 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 8 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 9 started\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 3.8s finished\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" test_roc_auc \n",
" test_accuracy \n",
" test_neg_log_loss \n",
" train_roc_auc \n",
" train_accuracy \n",
" train_neg_log_loss \n",
" fit_time \n",
" score_time \n",
" n_test_samples \n",
" fold_nb \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0.997332 \n",
" 0.990476 \n",
" -0.050391 \n",
" 0.999830 \n",
" 0.995758 \n",
" -0.029559 \n",
" 0.200567 \n",
" 0.076803 \n",
" 105 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 0.968369 \n",
" 0.961905 \n",
" -0.723250 \n",
" 0.999986 \n",
" 0.997879 \n",
" -0.022651 \n",
" 0.192597 \n",
" 0.066856 \n",
" 105 \n",
" 1 \n",
" \n",
" \n",
" 2 \n",
" 0.983232 \n",
" 0.942857 \n",
" -0.154483 \n",
" 0.999816 \n",
" 0.995758 \n",
" -0.026256 \n",
" 0.200526 \n",
" 0.069852 \n",
" 105 \n",
" 2 \n",
" \n",
" \n",
" 3 \n",
" 1.000000 \n",
" 1.000000 \n",
" -0.035742 \n",
" 0.999707 \n",
" 0.995758 \n",
" -0.030825 \n",
" 0.210022 \n",
" 0.070714 \n",
" 105 \n",
" 3 \n",
" \n",
" \n",
" 4 \n",
" 0.996380 \n",
" 0.961905 \n",
" -0.088300 \n",
" 0.999802 \n",
" 0.995758 \n",
" -0.028642 \n",
" 0.199074 \n",
" 0.064868 \n",
" 105 \n",
" 4 \n",
" \n",
" \n",
" 5 \n",
" 0.991806 \n",
" 0.952381 \n",
" -0.125793 \n",
" 0.999797 \n",
" 0.997879 \n",
" -0.025816 \n",
" 0.193644 \n",
" 0.070361 \n",
" 105 \n",
" 5 \n",
" \n",
" \n",
" 6 \n",
" 1.000000 \n",
" 1.000000 \n",
" -0.040940 \n",
" 0.999703 \n",
" 0.995758 \n",
" -0.029609 \n",
" 0.215009 \n",
" 0.065831 \n",
" 105 \n",
" 6 \n",
" \n",
" \n",
" 7 \n",
" 0.996380 \n",
" 0.980952 \n",
" -0.088508 \n",
" 0.999842 \n",
" 0.996819 \n",
" -0.026614 \n",
" 0.184709 \n",
" 0.077791 \n",
" 105 \n",
" 7 \n",
" \n",
" \n",
" 8 \n",
" 0.992838 \n",
" 0.971154 \n",
" -0.107017 \n",
" 0.999793 \n",
" 0.995763 \n",
" -0.027610 \n",
" 0.187533 \n",
" 0.063796 \n",
" 104 \n",
" 8 \n",
" \n",
" \n",
" 9 \n",
" 0.999613 \n",
" 0.980769 \n",
" -0.072026 \n",
" 0.999764 \n",
" 0.995763 \n",
" -0.028887 \n",
" 0.199505 \n",
" 0.062847 \n",
" 104 \n",
" 9 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" test_roc_auc test_accuracy test_neg_log_loss train_roc_auc \\\n",
"0 0.997332 0.990476 -0.050391 0.999830 \n",
"1 0.968369 0.961905 -0.723250 0.999986 \n",
"2 0.983232 0.942857 -0.154483 0.999816 \n",
"3 1.000000 1.000000 -0.035742 0.999707 \n",
"4 0.996380 0.961905 -0.088300 0.999802 \n",
"5 0.991806 0.952381 -0.125793 0.999797 \n",
"6 1.000000 1.000000 -0.040940 0.999703 \n",
"7 0.996380 0.980952 -0.088508 0.999842 \n",
"8 0.992838 0.971154 -0.107017 0.999793 \n",
"9 0.999613 0.980769 -0.072026 0.999764 \n",
"\n",
" train_accuracy train_neg_log_loss fit_time score_time n_test_samples \\\n",
"0 0.995758 -0.029559 0.200567 0.076803 105 \n",
"1 0.997879 -0.022651 0.192597 0.066856 105 \n",
"2 0.995758 -0.026256 0.200526 0.069852 105 \n",
"3 0.995758 -0.030825 0.210022 0.070714 105 \n",
"4 0.995758 -0.028642 0.199074 0.064868 105 \n",
"5 0.997879 -0.025816 0.193644 0.070361 105 \n",
"6 0.995758 -0.029609 0.215009 0.065831 105 \n",
"7 0.996819 -0.026614 0.184709 0.077791 105 \n",
"8 0.995763 -0.027610 0.187533 0.063796 104 \n",
"9 0.995763 -0.028887 0.199505 0.062847 104 \n",
"\n",
" fold_nb \n",
"0 0 \n",
"1 1 \n",
"2 2 \n",
"3 3 \n",
"4 4 \n",
"5 5 \n",
"6 6 \n",
"7 7 \n",
"8 8 \n",
"9 9 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from aikit.cross_validation import cross_validation\n",
"from sklearn.model_selection import StratifiedKFold\n",
"cv = StratifiedKFold(10, random_state=123, shuffle=True)\n",
"\n",
"cv_result = cross_validation(gpipeline, Xtrain.loc[:,non_text_cols], y_train,cv = cv,\n",
" scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n",
"cv_result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This cross-validate the complete Pipeline. The difference with sklearn function is that :\n",
"* you can score more than one metric at a time\n",
"* you retrieve train and test score"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"test_roc_auc 0.992595\n",
"test_accuracy 0.974240\n",
"test_neg_log_loss -0.148645\n",
"dtype: float64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can do the same but selecting the columns directly in the pipeline :"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"%3 \r\n",
" \r\n",
"\r\n",
"enc \r\n",
"\r\n",
"enc \r\n",
" \r\n",
"\r\n",
"imp \r\n",
"\r\n",
"imp \r\n",
" \r\n",
"\r\n",
"enc->imp \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"forest \r\n",
"\r\n",
"forest \r\n",
" \r\n",
"\r\n",
"imp->forest \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"sel \r\n",
"\r\n",
"sel \r\n",
" \r\n",
"\r\n",
"sel->enc \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n"
],
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from aikit.transformers import ColumnsSelector\n",
"gpipeline2 = GraphPipeline(models = { \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n",
" \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n",
" \"imp\":NumImputer(),\n",
" \"forest\":RandomForestClassifier(n_estimators=100, random_state=123)\n",
" },\n",
" edges = [(\"sel\",\"enc\",\"imp\",\"forest\")])\n",
"\n",
"gpipeline2.fit(Xtrain,y_train)\n",
"gpipeline2.graphviz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Remark : 'columns_to_use=\"object\"' tells aikit to encode the columns of type object, it will keep the rest untouched"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 0 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 1 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 2 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 3 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 4 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 5 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 6 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 7 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 8 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 9 started\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 4.0s finished\n"
]
},
{
"data": {
"text/plain": [
"test_roc_auc 0.991698\n",
"test_accuracy 0.972335\n",
"test_neg_log_loss -0.178280\n",
"dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_result = cross_validation(gpipeline2,Xtrain,y_train,cv = cv,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n",
"cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's see what we can do with the columns we excluded. We could craft features from them, but let's try to use them as text directly."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CountVectorizerWrapper(analyzer='word', column_prefix='BAG',\n",
" columns_to_use=['ticket', 'name'],\n",
" desired_output_type='SparseArray',\n",
" drop_unused_columns=True, drop_used_columns=True,\n",
" max_df=1.0, max_features=None, min_df=1, ngram_range=1,\n",
" regex_match=False, tfidf=False, vocabulary=None)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"text_cols = [\"ticket\",\"name\"]\n",
"vect = CountVectorizerWrapper(analyzer=\"word\", columns_to_use=text_cols)\n",
"vect.fit(Xtrain,y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Remark : aikit CountVectorizer can direcly work on 2 (or more) columns, no need to use a FeatureUnion or something of the sort"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['ticket__BAG__10482',\n",
" 'ticket__BAG__110152',\n",
" 'ticket__BAG__110413',\n",
" 'ticket__BAG__110465',\n",
" 'ticket__BAG__110469',\n",
" 'ticket__BAG__110489',\n",
" 'ticket__BAG__110564',\n",
" 'ticket__BAG__110813',\n",
" 'ticket__BAG__111163',\n",
" 'ticket__BAG__111240',\n",
" 'ticket__BAG__111320',\n",
" 'ticket__BAG__111361',\n",
" 'ticket__BAG__111369',\n",
" 'ticket__BAG__111426',\n",
" 'ticket__BAG__111427',\n",
" 'ticket__BAG__112050',\n",
" 'ticket__BAG__112052',\n",
" 'ticket__BAG__112053',\n",
" 'ticket__BAG__112058',\n",
" 'ticket__BAG__11206',\n",
" '...',\n",
" 'name__BAG__woolf',\n",
" 'name__BAG__woolner',\n",
" 'name__BAG__worth',\n",
" 'name__BAG__wright',\n",
" 'name__BAG__wyckoff',\n",
" 'name__BAG__yarred',\n",
" 'name__BAG__yasbeck',\n",
" 'name__BAG__ylio',\n",
" 'name__BAG__yoto',\n",
" 'name__BAG__young',\n",
" 'name__BAG__youseff',\n",
" 'name__BAG__yousif',\n",
" 'name__BAG__youssef',\n",
" 'name__BAG__yousseff',\n",
" 'name__BAG__yrois',\n",
" 'name__BAG__zabour',\n",
" 'name__BAG__zakarian',\n",
" 'name__BAG__zebley',\n",
" 'name__BAG__zenni',\n",
" 'name__BAG__zillah']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"features = vect.get_feature_names()\n",
"features[0:20] + [\"...\"] + features[-20:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The encoder directly encodes the 2 features"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<1048x2440 sparse matrix of type ''\n",
"\twith 5414 stored elements in COOrdinate format>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"xx_res = vect.transform(Xtrain)\n",
"xx_res"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again let's create a GraphPipeline to cross-validate"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"%3 \r\n",
" \r\n",
"\r\n",
"vect \r\n",
"\r\n",
"vect \r\n",
" \r\n",
"\r\n",
"logit \r\n",
"\r\n",
"logit \r\n",
" \r\n",
"\r\n",
"vect->logit \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n"
],
"text/plain": [
""
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gpipeline3 = GraphPipeline(models = {\"vect\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n",
" \"logit\":LogisticRegression(solver=\"liblinear\", random_state=123)},\n",
" edges=[(\"vect\",\"logit\")])\n",
"gpipeline3.fit(Xtrain,y_train)\n",
"gpipeline3.graphviz"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 0 started\n",
"\n",
"cv 1 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 2 started\n",
"\n",
"cv 3 started\n",
"\n",
"cv 4 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 5 started\n",
"\n",
"cv 6 started\n",
"\n",
"cv 7 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 8 started\n",
"\n",
"cv 9 started\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.9s finished\n"
]
},
{
"data": {
"text/plain": [
"test_roc_auc 0.850918\n",
"test_accuracy 0.819679\n",
"test_neg_log_loss -0.451681\n",
"dtype: float64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_result = cross_validation(gpipeline3, Xtrain,y_train,cv = cv,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n",
"cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also try we \"bag of char\""
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"%3 \r\n",
" \r\n",
"\r\n",
"vect \r\n",
"\r\n",
"vect \r\n",
" \r\n",
"\r\n",
"logit \r\n",
"\r\n",
"logit \r\n",
" \r\n",
"\r\n",
"vect->logit \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n"
],
"text/plain": [
""
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gpipeline4 = GraphPipeline(models = {\n",
" \"vect\": CountVectorizerWrapper(analyzer=\"char\",ngram_range=(1,4),columns_to_use=text_cols),\n",
" \"logit\": LogisticRegression(solver=\"liblinear\", random_state=123) }, edges=[(\"vect\",\"logit\")])\n",
"gpipeline4.fit(Xtrain,y_train)\n",
"gpipeline4.graphviz"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 0 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 1 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 2 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 3 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 4 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 5 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 6 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 7 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 8 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 9 started\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 5.9s finished\n"
]
},
{
"data": {
"text/plain": [
"test_roc_auc 0.849773\n",
"test_accuracy 0.813956\n",
"test_neg_log_loss -0.559254\n",
"dtype: float64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_result = cross_validation(gpipeline4,Xtrain,y_train,cv = cv,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n",
"cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Now let's use all the columns"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"%3 \r\n",
" \r\n",
"\r\n",
"enc \r\n",
"\r\n",
"enc \r\n",
" \r\n",
"\r\n",
"imp \r\n",
"\r\n",
"imp \r\n",
" \r\n",
"\r\n",
"enc->imp \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"rf \r\n",
"\r\n",
"rf \r\n",
" \r\n",
"\r\n",
"imp->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"sel \r\n",
"\r\n",
"sel \r\n",
" \r\n",
"\r\n",
"sel->enc \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect \r\n",
"\r\n",
"vect \r\n",
" \r\n",
"\r\n",
"vect->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n"
],
"text/plain": [
""
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gpipeline5 = GraphPipeline(models = {\n",
" \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n",
" \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n",
" \"imp\":NumImputer(),\n",
" \"vect\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n",
" \"rf\":RandomForestClassifier(n_estimators=100, random_state=123)\n",
" },\n",
" edges = [(\"sel\",\"enc\",\"imp\",\"rf\"),(\"vect\",\"rf\")])\n",
"gpipeline5.fit(Xtrain,y_train)\n",
"gpipeline5.graphviz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This model uses both set of columns:\n",
"* bag of word\n",
"* and categorical/numerical features"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 0 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 1 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 2 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 3 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 4 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 5 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 6 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 7 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 8 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 9 started\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 11.0s finished\n"
]
},
{
"data": {
"text/plain": [
"test_roc_auc 0.992779\n",
"test_accuracy 0.968507\n",
"test_neg_log_loss -0.173236\n",
"dtype: float64"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_result = cross_validation(gpipeline5,Xtrain,y_train,cv = cv,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n",
"cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use both Bag of Char and Bag of Word "
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"%3 \r\n",
" \r\n",
"\r\n",
"enc \r\n",
"\r\n",
"enc \r\n",
" \r\n",
"\r\n",
"imp \r\n",
"\r\n",
"imp \r\n",
" \r\n",
"\r\n",
"enc->imp \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"rf \r\n",
"\r\n",
"rf \r\n",
" \r\n",
"\r\n",
"imp->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"sel \r\n",
"\r\n",
"sel \r\n",
" \r\n",
"\r\n",
"sel->enc \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect_char \r\n",
"\r\n",
"vect_char \r\n",
" \r\n",
"\r\n",
"vect_char->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect_word \r\n",
"\r\n",
"vect_word \r\n",
" \r\n",
"\r\n",
"vect_word->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n"
],
"text/plain": [
""
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gpipeline6 = GraphPipeline(models = {\n",
" \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n",
" \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n",
" \"imp\":NumImputer(),\n",
" \"vect_char\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n",
" \"vect_word\":CountVectorizerWrapper(analyzer=\"char\",ngram_range=(1,4),columns_to_use=text_cols),\n",
" \"rf\":RandomForestClassifier(n_estimators=100, random_state=123)\n",
" },\n",
" edges = [(\"sel\",\"enc\",\"imp\",\"rf\"),(\"vect_char\",\"rf\"),(\"vect_word\",\"rf\")])\n",
"gpipeline6.fit(Xtrain,y_train)\n",
"gpipeline6.graphviz"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 0 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 1 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 2 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 3 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 4 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 5 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 6 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 7 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 8 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 9 started\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 13.9s finished\n"
]
},
{
"data": {
"text/plain": [
"test_roc_auc 0.947360\n",
"test_accuracy 0.843516\n",
"test_neg_log_loss -0.325666\n",
"dtype: float64"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_result = cross_validation(gpipeline6,Xtrain,y_train,cv = cv,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n",
"cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Maybe we can try SVD to limit dimension of bag of char/word features"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"%3 \r\n",
" \r\n",
"\r\n",
"enc \r\n",
"\r\n",
"enc \r\n",
" \r\n",
"\r\n",
"imp \r\n",
"\r\n",
"imp \r\n",
" \r\n",
"\r\n",
"enc->imp \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"rf \r\n",
"\r\n",
"rf \r\n",
" \r\n",
"\r\n",
"imp->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"sel \r\n",
"\r\n",
"sel \r\n",
" \r\n",
"\r\n",
"sel->enc \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"svd \r\n",
"\r\n",
"svd \r\n",
" \r\n",
"\r\n",
"svd->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect_word \r\n",
"\r\n",
"vect_word \r\n",
" \r\n",
"\r\n",
"vect_word->svd \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect_char \r\n",
"\r\n",
"vect_char \r\n",
" \r\n",
"\r\n",
"vect_char->svd \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n"
],
"text/plain": [
""
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gpipeline7 = GraphPipeline(models = {\n",
" \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n",
" \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n",
" \"imp\":NumImputer(),\n",
" \"vect_word\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n",
" \"vect_char\":CountVectorizerWrapper(analyzer=\"char\",ngram_range=(1,4),columns_to_use=text_cols),\n",
" \"svd\":TruncatedSVDWrapper(n_components=100, random_state=123),\n",
" \"rf\":RandomForestClassifier(n_estimators=100, random_state=123)\n",
" },\n",
" edges = [(\"sel\", \"enc\",\"imp\",\"rf\"),(\"vect_word\",\"svd\",\"rf\"),(\"vect_char\",\"svd\",\"rf\")])\n",
"gpipeline7.fit(Xtrain,y_train)\n",
"gpipeline7.graphviz"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 0 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 1 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 2 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 3 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 4 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 5 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 6 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 7 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 8 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 9 started\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 23.4s finished\n"
]
},
{
"data": {
"text/plain": [
"test_roc_auc 0.992953\n",
"test_accuracy 0.972326\n",
"test_neg_log_loss -0.167037\n",
"dtype: float64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_result = cross_validation(gpipeline7,Xtrain,y_train,cv = 10,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n",
"cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can even add 'SVD' columns AND bag of word/char columns "
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"%3 \r\n",
" \r\n",
"\r\n",
"enc \r\n",
"\r\n",
"enc \r\n",
" \r\n",
"\r\n",
"imp \r\n",
"\r\n",
"imp \r\n",
" \r\n",
"\r\n",
"enc->imp \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"rf \r\n",
"\r\n",
"rf \r\n",
" \r\n",
"\r\n",
"imp->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"sel \r\n",
"\r\n",
"sel \r\n",
" \r\n",
"\r\n",
"sel->enc \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"svd \r\n",
"\r\n",
"svd \r\n",
" \r\n",
"\r\n",
"svd->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect_word \r\n",
"\r\n",
"vect_word \r\n",
" \r\n",
"\r\n",
"vect_word->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect_word->svd \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect_char \r\n",
"\r\n",
"vect_char \r\n",
" \r\n",
"\r\n",
"vect_char->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect_char->svd \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n"
],
"text/plain": [
""
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gpipeline8 = GraphPipeline(models = {\n",
" \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n",
" \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n",
" \"imp\":NumImputer(),\n",
" \"vect_word\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n",
" \"vect_char\":CountVectorizerWrapper(analyzer=\"char\",ngram_range=(1,4),columns_to_use=text_cols),\n",
" \"svd\":TruncatedSVDWrapper(n_components=100, random_state=123),\n",
" \"rf\":RandomForestClassifier(n_estimators=100, random_state=123)\n",
" },\n",
" edges = [(\"sel\",\"enc\",\"imp\",\"rf\"),(\"vect_word\",\"svd\",\"rf\"),(\"vect_char\",\"svd\",\"rf\"),(\"vect_word\",\"rf\"),(\"vect_char\",\"rf\")])\n",
"\n",
"gpipeline8.graphviz"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 0 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 1 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 2 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 3 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 4 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 5 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 6 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 7 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 8 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 9 started\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 22.1s finished\n"
]
},
{
"data": {
"text/plain": [
"test_roc_auc 0.941329\n",
"test_accuracy 0.834011\n",
"test_neg_log_loss -0.334545\n",
"dtype: float64"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_result = cross_validation(gpipeline8,Xtrain,y_train,cv = 10,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n",
"cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead of 'SVD' we can add a layer that filter columns... "
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"from aikit.transformers import FeaturesSelectorClassifier"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"%3 \r\n",
" \r\n",
"\r\n",
"enc \r\n",
"\r\n",
"enc \r\n",
" \r\n",
"\r\n",
"imp \r\n",
"\r\n",
"imp \r\n",
" \r\n",
"\r\n",
"enc->imp \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"rf \r\n",
"\r\n",
"rf \r\n",
" \r\n",
"\r\n",
"imp->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"sel \r\n",
"\r\n",
"sel \r\n",
" \r\n",
"\r\n",
"sel->enc \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"selector \r\n",
"\r\n",
"selector \r\n",
" \r\n",
"\r\n",
"selector->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect_word \r\n",
"\r\n",
"vect_word \r\n",
" \r\n",
"\r\n",
"vect_word->selector \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect_char \r\n",
"\r\n",
"vect_char \r\n",
" \r\n",
"\r\n",
"vect_char->selector \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n"
],
"text/plain": [
""
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gpipeline9 = GraphPipeline(models = {\n",
" \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n",
" \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n",
" \"imp\":NumImputer(),\n",
" \"vect_word\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n",
" \"vect_char\":CountVectorizerWrapper(analyzer=\"char\",ngram_range=(1,4),columns_to_use=text_cols),\n",
" \"selector\":FeaturesSelectorClassifier(n_components=20),\n",
" \"rf\":RandomForestClassifier(n_estimators=100, random_state=123)\n",
" },\n",
" edges = [(\"sel\",\"enc\",\"imp\",\"rf\"),(\"vect_word\",\"selector\",\"rf\"),(\"vect_char\",\"selector\",\"rf\")])\n",
"\n",
"gpipeline9.graphviz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Retrieve feature importance\n",
"Let's use that complicated example to show how to retrieve the feature importance"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"boat____null__ 3.839758e-01\n",
"sex__female 3.816301e-02\n",
"name__BAG__mr 3.715979e-02\n",
"name__BAG__mr. 3.636483e-02\n",
"fare 3.419880e-02\n",
"name__BAG__mr. 3.133609e-02\n",
"sex__male 2.962421e-02\n",
"name__BAG__r. 2.910019e-02\n",
"name__BAG__s. 2.776609e-02\n",
"boat__15 2.672268e-02\n",
"age 2.643157e-02\n",
"name__BAG__s. 2.500470e-02\n",
"name__BAG__ mr. 2.249752e-02\n",
"boat__13 1.863079e-02\n",
"boat____default__ 1.711391e-02\n",
"pclass 1.665125e-02\n",
"name__BAG__ 1.597853e-02\n",
"sibsp 1.524516e-02\n",
"home_dest____null__ 1.015056e-02\n",
"boat__7 9.817018e-03\n",
"home_dest____default__ 9.534058e-03\n",
"boat__C 9.453317e-03\n",
"cabin____null__ 8.265959e-03\n",
"cabin____default__ 7.290940e-03\n",
"parch 7.138940e-03\n",
"embarked__S 6.643220e-03\n",
"boat__5 6.206360e-03\n",
"name__BAG__iss. 6.139824e-03\n",
"embarked__C 6.040638e-03\n",
"boat__3 5.547742e-03\n",
"name__BAG__( 5.352397e-03\n",
"name__BAG__mr 5.260205e-03\n",
"body_isnull 4.829877e-03\n",
"name__BAG__ ( 4.360392e-03\n",
"boat__16 4.245866e-03\n",
"boat__9 4.224166e-03\n",
"boat__D 4.194419e-03\n",
"name__BAG__ss 4.076246e-03\n",
"embarked__Q 4.047912e-03\n",
"name__BAG__mrs 3.602001e-03\n",
"body 2.955222e-03\n",
"name__BAG__rs 2.899086e-03\n",
"name__BAG__rs. 2.869114e-03\n",
"age_isnull 2.859144e-03\n",
"boat__14 2.809765e-03\n",
"boat__10 2.695927e-03\n",
"name__BAG__rs. 2.165917e-03\n",
"boat__12 2.103210e-03\n",
"name__BAG__mrs. 2.064884e-03\n",
"home_dest__New York, NY 1.799501e-03\n",
"boat__11 1.495248e-03\n",
"name__BAG__miss 1.054318e-03\n",
"name__BAG__mrs 9.616334e-04\n",
"boat__4 9.420111e-04\n",
"boat__6 8.184419e-04\n",
"home_dest__London 7.515602e-04\n",
"boat__8 3.679950e-04\n",
"fare_isnull 2.438310e-09\n",
"dtype: float64"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gpipeline9.fit(Xtrain, y_train)\n",
"\n",
"df_imp = pd.Series(gpipeline9.models[\"rf\"].feature_importances_,\n",
" index = gpipeline9.get_input_features_at_node(\"rf\"))\n",
"df_imp.sort_values(ascending=False,inplace=True)\n",
"df_imp"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 0 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 1 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 2 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 3 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 4 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 5 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 6 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 7 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 8 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 9 started\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 15.0s finished\n"
]
},
{
"data": {
"text/plain": [
"test_roc_auc 0.994108\n",
"test_accuracy 0.973288\n",
"test_neg_log_loss -0.153255\n",
"dtype: float64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_result = cross_validation(gpipeline9,Xtrain,y_train,cv = 10,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n",
"cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"%3 \r\n",
" \r\n",
"\r\n",
"enc \r\n",
"\r\n",
"enc \r\n",
" \r\n",
"\r\n",
"imp \r\n",
"\r\n",
"imp \r\n",
" \r\n",
"\r\n",
"enc->imp \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"rf \r\n",
"\r\n",
"rf \r\n",
" \r\n",
"\r\n",
"imp->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"sel \r\n",
"\r\n",
"sel \r\n",
" \r\n",
"\r\n",
"sel->enc \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"selector \r\n",
"\r\n",
"selector \r\n",
" \r\n",
"\r\n",
"selector->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect_word \r\n",
"\r\n",
"vect_word \r\n",
" \r\n",
"\r\n",
"vect_word->selector \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"svd \r\n",
"\r\n",
"svd \r\n",
" \r\n",
"\r\n",
"vect_word->svd \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"svd->rf \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect_char \r\n",
"\r\n",
"vect_char \r\n",
" \r\n",
"\r\n",
"vect_char->selector \r\n",
" \r\n",
" \r\n",
" \r\n",
"\r\n",
"vect_char->svd \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n",
" \r\n"
],
"text/plain": [
""
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gpipeline10 = GraphPipeline(models = {\n",
" \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n",
" \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n",
" \"imp\":NumImputer(),\n",
" \"vect_word\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n",
" \"vect_char\":CountVectorizerWrapper(analyzer=\"char\",ngram_range=(1,4),columns_to_use=text_cols),\n",
" \"svd\":TruncatedSVDWrapper(n_components=10),\n",
" \"selector\":FeaturesSelectorClassifier(n_components=10, random_state=123),\n",
" \"rf\":RandomForestClassifier(n_estimators=100, random_state=123)\n",
" },\n",
" edges = [(\"sel\",\"enc\",\"imp\",\"rf\"),\n",
" (\"vect_word\",\"selector\",\"rf\"),\n",
" (\"vect_char\",\"selector\",\"rf\"),\n",
" (\"vect_word\",\"svd\",\"rf\"),\n",
" (\"vect_char\",\"svd\",\"rf\")])\n",
"\n",
"gpipeline10.fit(Xtrain,y_train)\n",
"gpipeline10.graphviz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this model here is what is done :\n",
"* categorical columns are encoded ('enc')\n",
"* missing values are filled ('imp')\n",
"* bag of word and bag of char are created, for the two text features\n",
"* an SVD is done on those \n",
"* a selector is called to select most important bag of word/char features\n",
"* everything is given to a RandomForest"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 0 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 1 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 2 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 3 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 4 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 5 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 6 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 7 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 8 started\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"cv 9 started\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 16.9s finished\n"
]
},
{
"data": {
"text/plain": [
"test_roc_auc 0.994333\n",
"test_accuracy 0.975201\n",
"test_neg_log_loss -0.143788\n",
"dtype: float64"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cv_result = cross_validation(gpipeline10,Xtrain,y_train,cv = 10,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n",
"cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we saw the GraphPipeline allow flexibility in the creation of models and several choices can be easily tested.\n",
"\n",
"Again, it is not the best possible choices for that database, the example are here to illustrate the capabilities. \n",
"\n",
"Better score could be obtained by adjusting hyper-parameters and/or models/transformers and creating some new features.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}