{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### GraphPipeline getting started ###\n", "This notebook is here to show a few things that can be done by the package.\n", "\n", "It doesn't means that these are the things you should do on that particular dataset.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's load titanic dataset to test a few things" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore') # to remove gensim warning" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from aikit.datasets.datasets import load_dataset, DatasetEnum\n", "Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclassnamesexagesibspparchticketfarecabinembarkedboatbodyhome_dest
01McCarthy, Mr. Timothy Jmale54.0001746351.8625E46SNaN175.0Dorchester, MA
11Fortune, Mr. Markmale64.01419950263.0000C23 C25 C27SNaNNaNWinnipeg, MB
21Sagesser, Mlle. Emmafemale24.000PC 1747769.3000B35C9NaNNaN
33Panula, Master. Urho Abrahammale2.041310129539.6875NaNSNaNNaNNaN
41Maioni, Miss. Robertafemale16.00011015286.5000B79S8NaNNaN
53Waelens, Mr. Achillemale22.0003457679.0000NaNSNaNNaNAntwerp, Belgium / Stanton, OH
63Reed, Mr. James GeorgemaleNaN003623167.2500NaNSNaNNaNNaN
71Swift, Mrs. Frederick Joel (Margaret Welles Ba...female48.0001746625.9292D17S8NaNBrooklyn, NY
81Smith, Mrs. Lucien Philip (Mary Eloise Hughes)female18.0101369560.0000C31S6NaNHuntington, WV
91Rowe, Mr. Alfred Gmale33.00011379026.5500NaNSNaN109.0London
103Meo, Mr. Alfonzomale55.500A.5. 112068.0500NaNSNaN201.0NaN
113Abbott, Mr. Rossmore Edwardmale16.011C.A. 267320.2500NaNSNaN190.0East Providence, RI
123Elias, Mr. DibomaleNaN0026747.2250NaNCNaNNaNNaN
132Reynaldo, Ms. Encarnacionfemale28.00023043413.0000NaNS9NaNSpain
143Khalil, Mr. BetrosmaleNaN10266014.4542NaNCNaNNaNNaN
151Daniels, Miss. Sarahfemale33.000113781151.5500NaNS8NaNNaN
163Ford, Miss. Robina Maggie 'Ruby'female9.022W./C. 660834.3750NaNSNaNNaNRotherfield, Sussex, England Essex Co, MA
173Thorneycroft, Mrs. Percival (Florence Kate White)femaleNaN1037656416.1000NaNS10NaNNaN
183Lennon, Mr. DenismaleNaN1037037115.5000NaNQNaNNaNNaN
193de Pelsmaeker, Mr. Alfonsmale16.0003457789.5000NaNSNaNNaNNaN
\n", "
" ], "text/plain": [ " pclass name sex age \\\n", "0 1 McCarthy, Mr. Timothy J male 54.0 \n", "1 1 Fortune, Mr. Mark male 64.0 \n", "2 1 Sagesser, Mlle. Emma female 24.0 \n", "3 3 Panula, Master. Urho Abraham male 2.0 \n", "4 1 Maioni, Miss. Roberta female 16.0 \n", "5 3 Waelens, Mr. Achille male 22.0 \n", "6 3 Reed, Mr. James George male NaN \n", "7 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 \n", "8 1 Smith, Mrs. Lucien Philip (Mary Eloise Hughes) female 18.0 \n", "9 1 Rowe, Mr. Alfred G male 33.0 \n", "10 3 Meo, Mr. Alfonzo male 55.5 \n", "11 3 Abbott, Mr. Rossmore Edward male 16.0 \n", "12 3 Elias, Mr. Dibo male NaN \n", "13 2 Reynaldo, Ms. Encarnacion female 28.0 \n", "14 3 Khalil, Mr. Betros male NaN \n", "15 1 Daniels, Miss. Sarah female 33.0 \n", "16 3 Ford, Miss. Robina Maggie 'Ruby' female 9.0 \n", "17 3 Thorneycroft, Mrs. Percival (Florence Kate White) female NaN \n", "18 3 Lennon, Mr. Denis male NaN \n", "19 3 de Pelsmaeker, Mr. Alfons male 16.0 \n", "\n", " sibsp parch ticket fare cabin embarked boat body \\\n", "0 0 0 17463 51.8625 E46 S NaN 175.0 \n", "1 1 4 19950 263.0000 C23 C25 C27 S NaN NaN \n", "2 0 0 PC 17477 69.3000 B35 C 9 NaN \n", "3 4 1 3101295 39.6875 NaN S NaN NaN \n", "4 0 0 110152 86.5000 B79 S 8 NaN \n", "5 0 0 345767 9.0000 NaN S NaN NaN \n", "6 0 0 362316 7.2500 NaN S NaN NaN \n", "7 0 0 17466 25.9292 D17 S 8 NaN \n", "8 1 0 13695 60.0000 C31 S 6 NaN \n", "9 0 0 113790 26.5500 NaN S NaN 109.0 \n", "10 0 0 A.5. 11206 8.0500 NaN S NaN 201.0 \n", "11 1 1 C.A. 2673 20.2500 NaN S NaN 190.0 \n", "12 0 0 2674 7.2250 NaN C NaN NaN \n", "13 0 0 230434 13.0000 NaN S 9 NaN \n", "14 1 0 2660 14.4542 NaN C NaN NaN \n", "15 0 0 113781 151.5500 NaN S 8 NaN \n", "16 2 2 W./C. 6608 34.3750 NaN S NaN NaN \n", "17 1 0 376564 16.1000 NaN S 10 NaN \n", "18 1 0 370371 15.5000 NaN Q NaN NaN \n", "19 0 0 345778 9.5000 NaN S NaN NaN \n", "\n", " home_dest \n", "0 Dorchester, MA \n", "1 Winnipeg, MB \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "5 Antwerp, Belgium / Stanton, OH \n", "6 NaN \n", "7 Brooklyn, NY \n", "8 Huntington, WV \n", "9 London \n", "10 NaN \n", "11 East Providence, RI \n", "12 NaN \n", "13 Spain \n", "14 NaN \n", "15 NaN \n", "16 Rotherfield, Sussex, England Essex Co, MA \n", "17 NaN \n", "18 NaN \n", "19 NaN " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Xtrain.head(20)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0],\n", " dtype=int64)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train[0:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For now let's ignore the Name and Ticket column which should probably be handled as text" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Matplotlib won't work\n" ] } ], "source": [ "import pandas as pd\n", "from aikit.transformers import TruncatedSVDWrapper, NumImputer, CountVectorizerWrapper, NumericalEncoder\n", "from aikit.pipeline import GraphPipeline\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['pclass',\n", " 'sex',\n", " 'age',\n", " 'sibsp',\n", " 'parch',\n", " 'fare',\n", " 'cabin',\n", " 'embarked',\n", " 'boat',\n", " 'body',\n", " 'home_dest']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "non_text_cols = [c for c in Xtrain.columns if c not in (\"ticket\",\"name\")] # everything that is not text\n", "non_text_cols" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "%3\r\n", "\r\n", "\r\n", "imp\r\n", "\r\n", "imp\r\n", "\r\n", "\r\n", "forest\r\n", "\r\n", "forest\r\n", "\r\n", "\r\n", "imp->forest\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "enc\r\n", "\r\n", "enc\r\n", "\r\n", "\r\n", "enc->imp\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpipeline = GraphPipeline(models = { \"enc\":NumericalEncoder(),\n", " \"imp\":NumImputer(),\n", " \"forest\":RandomForestClassifier(n_estimators=100)\n", " },\n", " edges = [(\"enc\",\"imp\",\"forest\")])\n", "\n", "gpipeline.fit(Xtrain.loc[:,non_text_cols],y_train)\n", "gpipeline.graphviz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Let's do a cross-validation" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 0 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 1 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 2 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 3 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 4 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 5 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 6 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 7 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 8 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 9 started\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 3.8s finished\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
test_roc_auctest_accuracytest_neg_log_losstrain_roc_auctrain_accuracytrain_neg_log_lossfit_timescore_timen_test_samplesfold_nb
00.9973320.990476-0.0503910.9998300.995758-0.0295590.2005670.0768031050
10.9683690.961905-0.7232500.9999860.997879-0.0226510.1925970.0668561051
20.9832320.942857-0.1544830.9998160.995758-0.0262560.2005260.0698521052
31.0000001.000000-0.0357420.9997070.995758-0.0308250.2100220.0707141053
40.9963800.961905-0.0883000.9998020.995758-0.0286420.1990740.0648681054
50.9918060.952381-0.1257930.9997970.997879-0.0258160.1936440.0703611055
61.0000001.000000-0.0409400.9997030.995758-0.0296090.2150090.0658311056
70.9963800.980952-0.0885080.9998420.996819-0.0266140.1847090.0777911057
80.9928380.971154-0.1070170.9997930.995763-0.0276100.1875330.0637961048
90.9996130.980769-0.0720260.9997640.995763-0.0288870.1995050.0628471049
\n", "
" ], "text/plain": [ " test_roc_auc test_accuracy test_neg_log_loss train_roc_auc \\\n", "0 0.997332 0.990476 -0.050391 0.999830 \n", "1 0.968369 0.961905 -0.723250 0.999986 \n", "2 0.983232 0.942857 -0.154483 0.999816 \n", "3 1.000000 1.000000 -0.035742 0.999707 \n", "4 0.996380 0.961905 -0.088300 0.999802 \n", "5 0.991806 0.952381 -0.125793 0.999797 \n", "6 1.000000 1.000000 -0.040940 0.999703 \n", "7 0.996380 0.980952 -0.088508 0.999842 \n", "8 0.992838 0.971154 -0.107017 0.999793 \n", "9 0.999613 0.980769 -0.072026 0.999764 \n", "\n", " train_accuracy train_neg_log_loss fit_time score_time n_test_samples \\\n", "0 0.995758 -0.029559 0.200567 0.076803 105 \n", "1 0.997879 -0.022651 0.192597 0.066856 105 \n", "2 0.995758 -0.026256 0.200526 0.069852 105 \n", "3 0.995758 -0.030825 0.210022 0.070714 105 \n", "4 0.995758 -0.028642 0.199074 0.064868 105 \n", "5 0.997879 -0.025816 0.193644 0.070361 105 \n", "6 0.995758 -0.029609 0.215009 0.065831 105 \n", "7 0.996819 -0.026614 0.184709 0.077791 105 \n", "8 0.995763 -0.027610 0.187533 0.063796 104 \n", "9 0.995763 -0.028887 0.199505 0.062847 104 \n", "\n", " fold_nb \n", "0 0 \n", "1 1 \n", "2 2 \n", "3 3 \n", "4 4 \n", "5 5 \n", "6 6 \n", "7 7 \n", "8 8 \n", "9 9 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from aikit.cross_validation import cross_validation\n", "from sklearn.model_selection import StratifiedKFold\n", "cv = StratifiedKFold(10, random_state=123, shuffle=True)\n", "\n", "cv_result = cross_validation(gpipeline, Xtrain.loc[:,non_text_cols], y_train,cv = cv,\n", " scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n", "cv_result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This cross-validate the complete Pipeline. The difference with sklearn function is that :\n", "* you can score more than one metric at a time\n", "* you retrieve train and test score" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "test_roc_auc 0.992595\n", "test_accuracy 0.974240\n", "test_neg_log_loss -0.148645\n", "dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do the same but selecting the columns directly in the pipeline :" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "%3\r\n", "\r\n", "\r\n", "enc\r\n", "\r\n", "enc\r\n", "\r\n", "\r\n", "imp\r\n", "\r\n", "imp\r\n", "\r\n", "\r\n", "enc->imp\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "forest\r\n", "\r\n", "forest\r\n", "\r\n", "\r\n", "imp->forest\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "sel\r\n", "\r\n", "sel\r\n", "\r\n", "\r\n", "sel->enc\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from aikit.transformers import ColumnsSelector\n", "gpipeline2 = GraphPipeline(models = { \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n", " \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n", " \"imp\":NumImputer(),\n", " \"forest\":RandomForestClassifier(n_estimators=100, random_state=123)\n", " },\n", " edges = [(\"sel\",\"enc\",\"imp\",\"forest\")])\n", "\n", "gpipeline2.fit(Xtrain,y_train)\n", "gpipeline2.graphviz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Remark : 'columns_to_use=\"object\"' tells aikit to encode the columns of type object, it will keep the rest untouched" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 0 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 1 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 2 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 3 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 4 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 5 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 6 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 7 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 8 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 9 started\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 4.0s finished\n" ] }, { "data": { "text/plain": [ "test_roc_auc 0.991698\n", "test_accuracy 0.972335\n", "test_neg_log_loss -0.178280\n", "dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv_result = cross_validation(gpipeline2,Xtrain,y_train,cv = cv,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n", "cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's see what we can do with the columns we excluded. We could craft features from them, but let's try to use them as text directly." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CountVectorizerWrapper(analyzer='word', column_prefix='BAG',\n", " columns_to_use=['ticket', 'name'],\n", " desired_output_type='SparseArray',\n", " drop_unused_columns=True, drop_used_columns=True,\n", " max_df=1.0, max_features=None, min_df=1, ngram_range=1,\n", " regex_match=False, tfidf=False, vocabulary=None)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_cols = [\"ticket\",\"name\"]\n", "vect = CountVectorizerWrapper(analyzer=\"word\", columns_to_use=text_cols)\n", "vect.fit(Xtrain,y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Remark : aikit CountVectorizer can direcly work on 2 (or more) columns, no need to use a FeatureUnion or something of the sort" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ticket__BAG__10482',\n", " 'ticket__BAG__110152',\n", " 'ticket__BAG__110413',\n", " 'ticket__BAG__110465',\n", " 'ticket__BAG__110469',\n", " 'ticket__BAG__110489',\n", " 'ticket__BAG__110564',\n", " 'ticket__BAG__110813',\n", " 'ticket__BAG__111163',\n", " 'ticket__BAG__111240',\n", " 'ticket__BAG__111320',\n", " 'ticket__BAG__111361',\n", " 'ticket__BAG__111369',\n", " 'ticket__BAG__111426',\n", " 'ticket__BAG__111427',\n", " 'ticket__BAG__112050',\n", " 'ticket__BAG__112052',\n", " 'ticket__BAG__112053',\n", " 'ticket__BAG__112058',\n", " 'ticket__BAG__11206',\n", " '...',\n", " 'name__BAG__woolf',\n", " 'name__BAG__woolner',\n", " 'name__BAG__worth',\n", " 'name__BAG__wright',\n", " 'name__BAG__wyckoff',\n", " 'name__BAG__yarred',\n", " 'name__BAG__yasbeck',\n", " 'name__BAG__ylio',\n", " 'name__BAG__yoto',\n", " 'name__BAG__young',\n", " 'name__BAG__youseff',\n", " 'name__BAG__yousif',\n", " 'name__BAG__youssef',\n", " 'name__BAG__yousseff',\n", " 'name__BAG__yrois',\n", " 'name__BAG__zabour',\n", " 'name__BAG__zakarian',\n", " 'name__BAG__zebley',\n", " 'name__BAG__zenni',\n", " 'name__BAG__zillah']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features = vect.get_feature_names()\n", "features[0:20] + [\"...\"] + features[-20:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The encoder directly encodes the 2 features" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<1048x2440 sparse matrix of type ''\n", "\twith 5414 stored elements in COOrdinate format>" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "xx_res = vect.transform(Xtrain)\n", "xx_res" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again let's create a GraphPipeline to cross-validate" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "%3\r\n", "\r\n", "\r\n", "vect\r\n", "\r\n", "vect\r\n", "\r\n", "\r\n", "logit\r\n", "\r\n", "logit\r\n", "\r\n", "\r\n", "vect->logit\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpipeline3 = GraphPipeline(models = {\"vect\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n", " \"logit\":LogisticRegression(solver=\"liblinear\", random_state=123)},\n", " edges=[(\"vect\",\"logit\")])\n", "gpipeline3.fit(Xtrain,y_train)\n", "gpipeline3.graphviz" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 0 started\n", "\n", "cv 1 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 2 started\n", "\n", "cv 3 started\n", "\n", "cv 4 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 5 started\n", "\n", "cv 6 started\n", "\n", "cv 7 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 8 started\n", "\n", "cv 9 started\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.9s finished\n" ] }, { "data": { "text/plain": [ "test_roc_auc 0.850918\n", "test_accuracy 0.819679\n", "test_neg_log_loss -0.451681\n", "dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv_result = cross_validation(gpipeline3, Xtrain,y_train,cv = cv,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n", "cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also try we \"bag of char\"" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "%3\r\n", "\r\n", "\r\n", "vect\r\n", "\r\n", "vect\r\n", "\r\n", "\r\n", "logit\r\n", "\r\n", "logit\r\n", "\r\n", "\r\n", "vect->logit\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpipeline4 = GraphPipeline(models = {\n", " \"vect\": CountVectorizerWrapper(analyzer=\"char\",ngram_range=(1,4),columns_to_use=text_cols),\n", " \"logit\": LogisticRegression(solver=\"liblinear\", random_state=123) }, edges=[(\"vect\",\"logit\")])\n", "gpipeline4.fit(Xtrain,y_train)\n", "gpipeline4.graphviz" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 0 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 1 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 2 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 3 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 4 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 5 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 6 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 7 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 8 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 9 started\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 5.9s finished\n" ] }, { "data": { "text/plain": [ "test_roc_auc 0.849773\n", "test_accuracy 0.813956\n", "test_neg_log_loss -0.559254\n", "dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv_result = cross_validation(gpipeline4,Xtrain,y_train,cv = cv,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n", "cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Now let's use all the columns" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "%3\r\n", "\r\n", "\r\n", "enc\r\n", "\r\n", "enc\r\n", "\r\n", "\r\n", "imp\r\n", "\r\n", "imp\r\n", "\r\n", "\r\n", "enc->imp\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "rf\r\n", "\r\n", "rf\r\n", "\r\n", "\r\n", "imp->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "sel\r\n", "\r\n", "sel\r\n", "\r\n", "\r\n", "sel->enc\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect\r\n", "\r\n", "vect\r\n", "\r\n", "\r\n", "vect->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpipeline5 = GraphPipeline(models = {\n", " \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n", " \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n", " \"imp\":NumImputer(),\n", " \"vect\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n", " \"rf\":RandomForestClassifier(n_estimators=100, random_state=123)\n", " },\n", " edges = [(\"sel\",\"enc\",\"imp\",\"rf\"),(\"vect\",\"rf\")])\n", "gpipeline5.fit(Xtrain,y_train)\n", "gpipeline5.graphviz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This model uses both set of columns:\n", "* bag of word\n", "* and categorical/numerical features" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 0 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 1 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 2 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 3 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 4 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 5 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 6 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 7 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 8 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 9 started\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 11.0s finished\n" ] }, { "data": { "text/plain": [ "test_roc_auc 0.992779\n", "test_accuracy 0.968507\n", "test_neg_log_loss -0.173236\n", "dtype: float64" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv_result = cross_validation(gpipeline5,Xtrain,y_train,cv = cv,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n", "cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use both Bag of Char and Bag of Word " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "%3\r\n", "\r\n", "\r\n", "enc\r\n", "\r\n", "enc\r\n", "\r\n", "\r\n", "imp\r\n", "\r\n", "imp\r\n", "\r\n", "\r\n", "enc->imp\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "rf\r\n", "\r\n", "rf\r\n", "\r\n", "\r\n", "imp->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "sel\r\n", "\r\n", "sel\r\n", "\r\n", "\r\n", "sel->enc\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect_char\r\n", "\r\n", "vect_char\r\n", "\r\n", "\r\n", "vect_char->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect_word\r\n", "\r\n", "vect_word\r\n", "\r\n", "\r\n", "vect_word->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpipeline6 = GraphPipeline(models = {\n", " \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n", " \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n", " \"imp\":NumImputer(),\n", " \"vect_char\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n", " \"vect_word\":CountVectorizerWrapper(analyzer=\"char\",ngram_range=(1,4),columns_to_use=text_cols),\n", " \"rf\":RandomForestClassifier(n_estimators=100, random_state=123)\n", " },\n", " edges = [(\"sel\",\"enc\",\"imp\",\"rf\"),(\"vect_char\",\"rf\"),(\"vect_word\",\"rf\")])\n", "gpipeline6.fit(Xtrain,y_train)\n", "gpipeline6.graphviz" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 0 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 1 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 2 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 3 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 4 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 5 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 6 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 7 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 8 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 9 started\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 13.9s finished\n" ] }, { "data": { "text/plain": [ "test_roc_auc 0.947360\n", "test_accuracy 0.843516\n", "test_neg_log_loss -0.325666\n", "dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv_result = cross_validation(gpipeline6,Xtrain,y_train,cv = cv,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n", "cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Maybe we can try SVD to limit dimension of bag of char/word features" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "%3\r\n", "\r\n", "\r\n", "enc\r\n", "\r\n", "enc\r\n", "\r\n", "\r\n", "imp\r\n", "\r\n", "imp\r\n", "\r\n", "\r\n", "enc->imp\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "rf\r\n", "\r\n", "rf\r\n", "\r\n", "\r\n", "imp->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "sel\r\n", "\r\n", "sel\r\n", "\r\n", "\r\n", "sel->enc\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "svd\r\n", "\r\n", "svd\r\n", "\r\n", "\r\n", "svd->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect_word\r\n", "\r\n", "vect_word\r\n", "\r\n", "\r\n", "vect_word->svd\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect_char\r\n", "\r\n", "vect_char\r\n", "\r\n", "\r\n", "vect_char->svd\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpipeline7 = GraphPipeline(models = {\n", " \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n", " \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n", " \"imp\":NumImputer(),\n", " \"vect_word\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n", " \"vect_char\":CountVectorizerWrapper(analyzer=\"char\",ngram_range=(1,4),columns_to_use=text_cols),\n", " \"svd\":TruncatedSVDWrapper(n_components=100, random_state=123),\n", " \"rf\":RandomForestClassifier(n_estimators=100, random_state=123)\n", " },\n", " edges = [(\"sel\", \"enc\",\"imp\",\"rf\"),(\"vect_word\",\"svd\",\"rf\"),(\"vect_char\",\"svd\",\"rf\")])\n", "gpipeline7.fit(Xtrain,y_train)\n", "gpipeline7.graphviz" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 0 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 1 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 2 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 3 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 4 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 5 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 6 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 7 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 8 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 9 started\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 23.4s finished\n" ] }, { "data": { "text/plain": [ "test_roc_auc 0.992953\n", "test_accuracy 0.972326\n", "test_neg_log_loss -0.167037\n", "dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv_result = cross_validation(gpipeline7,Xtrain,y_train,cv = 10,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n", "cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can even add 'SVD' columns AND bag of word/char columns " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "%3\r\n", "\r\n", "\r\n", "enc\r\n", "\r\n", "enc\r\n", "\r\n", "\r\n", "imp\r\n", "\r\n", "imp\r\n", "\r\n", "\r\n", "enc->imp\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "rf\r\n", "\r\n", "rf\r\n", "\r\n", "\r\n", "imp->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "sel\r\n", "\r\n", "sel\r\n", "\r\n", "\r\n", "sel->enc\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "svd\r\n", "\r\n", "svd\r\n", "\r\n", "\r\n", "svd->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect_word\r\n", "\r\n", "vect_word\r\n", "\r\n", "\r\n", "vect_word->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect_word->svd\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect_char\r\n", "\r\n", "vect_char\r\n", "\r\n", "\r\n", "vect_char->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect_char->svd\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpipeline8 = GraphPipeline(models = {\n", " \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n", " \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n", " \"imp\":NumImputer(),\n", " \"vect_word\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n", " \"vect_char\":CountVectorizerWrapper(analyzer=\"char\",ngram_range=(1,4),columns_to_use=text_cols),\n", " \"svd\":TruncatedSVDWrapper(n_components=100, random_state=123),\n", " \"rf\":RandomForestClassifier(n_estimators=100, random_state=123)\n", " },\n", " edges = [(\"sel\",\"enc\",\"imp\",\"rf\"),(\"vect_word\",\"svd\",\"rf\"),(\"vect_char\",\"svd\",\"rf\"),(\"vect_word\",\"rf\"),(\"vect_char\",\"rf\")])\n", "\n", "gpipeline8.graphviz" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 0 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 1 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 2 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 3 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 4 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 5 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 6 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 7 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 8 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 9 started\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 22.1s finished\n" ] }, { "data": { "text/plain": [ "test_roc_auc 0.941329\n", "test_accuracy 0.834011\n", "test_neg_log_loss -0.334545\n", "dtype: float64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv_result = cross_validation(gpipeline8,Xtrain,y_train,cv = 10,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n", "cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of 'SVD' we can add a layer that filter columns... " ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "from aikit.transformers import FeaturesSelectorClassifier" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "%3\r\n", "\r\n", "\r\n", "enc\r\n", "\r\n", "enc\r\n", "\r\n", "\r\n", "imp\r\n", "\r\n", "imp\r\n", "\r\n", "\r\n", "enc->imp\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "rf\r\n", "\r\n", "rf\r\n", "\r\n", "\r\n", "imp->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "sel\r\n", "\r\n", "sel\r\n", "\r\n", "\r\n", "sel->enc\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "selector\r\n", "\r\n", "selector\r\n", "\r\n", "\r\n", "selector->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect_word\r\n", "\r\n", "vect_word\r\n", "\r\n", "\r\n", "vect_word->selector\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect_char\r\n", "\r\n", "vect_char\r\n", "\r\n", "\r\n", "vect_char->selector\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpipeline9 = GraphPipeline(models = {\n", " \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n", " \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n", " \"imp\":NumImputer(),\n", " \"vect_word\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n", " \"vect_char\":CountVectorizerWrapper(analyzer=\"char\",ngram_range=(1,4),columns_to_use=text_cols),\n", " \"selector\":FeaturesSelectorClassifier(n_components=20),\n", " \"rf\":RandomForestClassifier(n_estimators=100, random_state=123)\n", " },\n", " edges = [(\"sel\",\"enc\",\"imp\",\"rf\"),(\"vect_word\",\"selector\",\"rf\"),(\"vect_char\",\"selector\",\"rf\")])\n", "\n", "gpipeline9.graphviz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Retrieve feature importance\n", "Let's use that complicated example to show how to retrieve the feature importance" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "boat____null__ 3.839758e-01\n", "sex__female 3.816301e-02\n", "name__BAG__mr 3.715979e-02\n", "name__BAG__mr. 3.636483e-02\n", "fare 3.419880e-02\n", "name__BAG__mr. 3.133609e-02\n", "sex__male 2.962421e-02\n", "name__BAG__r. 2.910019e-02\n", "name__BAG__s. 2.776609e-02\n", "boat__15 2.672268e-02\n", "age 2.643157e-02\n", "name__BAG__s. 2.500470e-02\n", "name__BAG__ mr. 2.249752e-02\n", "boat__13 1.863079e-02\n", "boat____default__ 1.711391e-02\n", "pclass 1.665125e-02\n", "name__BAG__ 1.597853e-02\n", "sibsp 1.524516e-02\n", "home_dest____null__ 1.015056e-02\n", "boat__7 9.817018e-03\n", "home_dest____default__ 9.534058e-03\n", "boat__C 9.453317e-03\n", "cabin____null__ 8.265959e-03\n", "cabin____default__ 7.290940e-03\n", "parch 7.138940e-03\n", "embarked__S 6.643220e-03\n", "boat__5 6.206360e-03\n", "name__BAG__iss. 6.139824e-03\n", "embarked__C 6.040638e-03\n", "boat__3 5.547742e-03\n", "name__BAG__( 5.352397e-03\n", "name__BAG__mr 5.260205e-03\n", "body_isnull 4.829877e-03\n", "name__BAG__ ( 4.360392e-03\n", "boat__16 4.245866e-03\n", "boat__9 4.224166e-03\n", "boat__D 4.194419e-03\n", "name__BAG__ss 4.076246e-03\n", "embarked__Q 4.047912e-03\n", "name__BAG__mrs 3.602001e-03\n", "body 2.955222e-03\n", "name__BAG__rs 2.899086e-03\n", "name__BAG__rs. 2.869114e-03\n", "age_isnull 2.859144e-03\n", "boat__14 2.809765e-03\n", "boat__10 2.695927e-03\n", "name__BAG__rs. 2.165917e-03\n", "boat__12 2.103210e-03\n", "name__BAG__mrs. 2.064884e-03\n", "home_dest__New York, NY 1.799501e-03\n", "boat__11 1.495248e-03\n", "name__BAG__miss 1.054318e-03\n", "name__BAG__mrs 9.616334e-04\n", "boat__4 9.420111e-04\n", "boat__6 8.184419e-04\n", "home_dest__London 7.515602e-04\n", "boat__8 3.679950e-04\n", "fare_isnull 2.438310e-09\n", "dtype: float64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpipeline9.fit(Xtrain, y_train)\n", "\n", "df_imp = pd.Series(gpipeline9.models[\"rf\"].feature_importances_,\n", " index = gpipeline9.get_input_features_at_node(\"rf\"))\n", "df_imp.sort_values(ascending=False,inplace=True)\n", "df_imp" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 0 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 1 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 2 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 3 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 4 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 5 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 6 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 7 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 8 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 9 started\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 15.0s finished\n" ] }, { "data": { "text/plain": [ "test_roc_auc 0.994108\n", "test_accuracy 0.973288\n", "test_neg_log_loss -0.153255\n", "dtype: float64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv_result = cross_validation(gpipeline9,Xtrain,y_train,cv = 10,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n", "cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "%3\r\n", "\r\n", "\r\n", "enc\r\n", "\r\n", "enc\r\n", "\r\n", "\r\n", "imp\r\n", "\r\n", "imp\r\n", "\r\n", "\r\n", "enc->imp\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "rf\r\n", "\r\n", "rf\r\n", "\r\n", "\r\n", "imp->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "sel\r\n", "\r\n", "sel\r\n", "\r\n", "\r\n", "sel->enc\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "selector\r\n", "\r\n", "selector\r\n", "\r\n", "\r\n", "selector->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect_word\r\n", "\r\n", "vect_word\r\n", "\r\n", "\r\n", "vect_word->selector\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "svd\r\n", "\r\n", "svd\r\n", "\r\n", "\r\n", "vect_word->svd\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "svd->rf\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect_char\r\n", "\r\n", "vect_char\r\n", "\r\n", "\r\n", "vect_char->selector\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "vect_char->svd\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "\r\n" ], "text/plain": [ "" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpipeline10 = GraphPipeline(models = {\n", " \"sel\":ColumnsSelector(columns_to_use=non_text_cols),\n", " \"enc\":NumericalEncoder(columns_to_use=\"object\"),\n", " \"imp\":NumImputer(),\n", " \"vect_word\":CountVectorizerWrapper(analyzer=\"word\",columns_to_use=text_cols),\n", " \"vect_char\":CountVectorizerWrapper(analyzer=\"char\",ngram_range=(1,4),columns_to_use=text_cols),\n", " \"svd\":TruncatedSVDWrapper(n_components=10),\n", " \"selector\":FeaturesSelectorClassifier(n_components=10, random_state=123),\n", " \"rf\":RandomForestClassifier(n_estimators=100, random_state=123)\n", " },\n", " edges = [(\"sel\",\"enc\",\"imp\",\"rf\"),\n", " (\"vect_word\",\"selector\",\"rf\"),\n", " (\"vect_char\",\"selector\",\"rf\"),\n", " (\"vect_word\",\"svd\",\"rf\"),\n", " (\"vect_char\",\"svd\",\"rf\")])\n", "\n", "gpipeline10.fit(Xtrain,y_train)\n", "gpipeline10.graphviz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this model here is what is done :\n", "* categorical columns are encoded ('enc')\n", "* missing values are filled ('imp')\n", "* bag of word and bag of char are created, for the two text features\n", "* an SVD is done on those \n", "* a selector is called to select most important bag of word/char features\n", "* everything is given to a RandomForest" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 0 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 1 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 2 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 3 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 4 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 5 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 6 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 7 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 8 started\n", "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "cv 9 started\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 16.9s finished\n" ] }, { "data": { "text/plain": [ "test_roc_auc 0.994333\n", "test_accuracy 0.975201\n", "test_neg_log_loss -0.143788\n", "dtype: float64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cv_result = cross_validation(gpipeline10,Xtrain,y_train,cv = 10,scoring=[\"roc_auc\",\"accuracy\",\"neg_log_loss\"])\n", "cv_result.loc[:,(\"test_roc_auc\",\"test_accuracy\",\"test_neg_log_loss\")].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we saw the GraphPipeline allow flexibility in the creation of models and several choices can be easily tested.\n", "\n", "Again, it is not the best possible choices for that database, the example are here to illustrate the capabilities.\n", "\n", "Better score could be obtained by adjusting hyper-parameters and/or models/transformers and creating some new features.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" } }, "nbformat": 4, "nbformat_minor": 2 }