Getting Started¶

This notebook will show you how to built a complexe pipeline using aikit and how to crossvalidated it

[3]:

from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
Xtrain.head(10)

[3]:

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home_dest
0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S	NaN	175.0	Dorchester, MA
1	1	Fortune, Mr. Mark	male	64.0	1	4	19950	263.0000	C23 C25 C27	S	NaN	NaN	Winnipeg, MB
2	1	Sagesser, Mlle. Emma	female	24.0	0	0	PC 17477	69.3000	B35	C	9	NaN	NaN
3	3	Panula, Master. Urho Abraham	male	2.0	4	1	3101295	39.6875	NaN	S	NaN	NaN	NaN
4	1	Maioni, Miss. Roberta	female	16.0	0	0	110152	86.5000	B79	S	8	NaN	NaN
5	3	Waelens, Mr. Achille	male	22.0	0	0	345767	9.0000	NaN	S	NaN	NaN	Antwerp, Belgium / Stanton, OH
6	3	Reed, Mr. James George	male	NaN	0	0	362316	7.2500	NaN	S	NaN	NaN	NaN
7	1	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	0	0	17466	25.9292	D17	S	8	NaN	Brooklyn, NY
8	1	Smith, Mrs. Lucien Philip (Mary Eloise Hughes)	female	18.0	1	0	13695	60.0000	C31	S	6	NaN	Huntington, WV
9	1	Rowe, Mr. Alfred G	male	33.0	0	0	113790	26.5500	NaN	S	NaN	109.0	London

[4]:

y_train[0:10]

[4]:

array([0, 0, 1, 0, 1, 0, 0, 1, 1, 0], dtype=int64)

[13]:

from aikit.pipeline import GraphPipeline
from aikit.transformers import ColumnsSelector, NumericalEncoder, NumImputer, CountVectorizerWrapper
from sklearn.ensemble import RandomForestClassifier

text_cols     = ["name","ticket"]
non_text_cols = [c for c in Xtrain.columns if c not in text_cols]

gpipeline = GraphPipeline(models = {
    "sel":ColumnsSelector(columns_to_use=non_text_cols),
    "enc":NumericalEncoder(columns_to_use="object"),
    "imp":NumImputer(),
    "vect":CountVectorizerWrapper(analyzer="word",columns_to_use=text_cols),
    "rf":RandomForestClassifier(n_estimators=100, random_state=123)
                       },
              edges = [("sel","enc","imp","rf"),("vect","rf")])

gpipeline.fit(Xtrain,y_train)
gpipeline.graphviz

[13]:

../_images/notebooks_GettingStarted_3_0.svg

[17]:

from aikit.cross_validation import cross_validation
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(10, shuffle=True, random_state=123)

cv_res, yhat_proba = cross_validation(gpipeline, Xtrain, y_train,cv=cv, scoring=["accuracy", "roc_auc", "neg_log_loss"], return_predict=True, method="predict_proba")

cv_res

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

cv 0 started

cv 1 started

cv 2 started

cv 3 started

cv 4 started

cv 5 started

cv 6 started

cv 7 started

cv 8 started

cv 9 started

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:   13.3s finished

[17]:

	test_accuracy	test_roc_auc	test_neg_log_loss	train_accuracy	train_roc_auc	train_neg_log_loss	fit_time	score_time	n_test_samples	fold_nb
0	0.980952	0.998095	-0.116852	1.0	1.0	-0.043928	0.841272	0.133642	105	0
1	0.961905	0.969512	-0.484764	1.0	1.0	-0.042055	0.903584	0.126663	105	1
2	0.961905	0.994474	-0.139215	1.0	1.0	-0.041864	0.747745	0.117702	105	2
3	0.990476	1.000000	-0.102579	1.0	1.0	-0.042316	0.735497	0.120638	105	3
4	0.952381	0.994284	-0.130034	1.0	1.0	-0.044032	0.772531	0.119271	105	4
5	0.961905	0.996570	-0.134116	1.0	1.0	-0.041499	0.725558	0.126695	105	5
6	0.971429	0.998476	-0.140661	1.0	1.0	-0.047236	0.761099	0.116698	105	6
7	0.961905	0.995617	-0.155353	1.0	1.0	-0.040947	0.734543	0.111288	105	7
8	0.961538	0.994386	-0.132630	1.0	1.0	-0.041335	0.740471	0.111782	104	8
9	0.980769	0.996903	-0.150282	1.0	1.0	-0.044830	0.749113	0.112777	104	9

[18]:

yhat_proba.head(10)

[18]:

	0	1
0	1.00	0.00
1	0.88	0.12
2	0.05	0.95
3	0.93	0.07
4	0.07	0.93
5	1.00	0.00
6	1.00	0.00
7	0.03	0.97
8	0.06	0.94
9	0.98	0.02

Using cross_validation you get in one call :¶

both train and test score
all the metrics
the probabilities predicted for each observation