Auto ML¶

This notebook will explain the auto-ml capabilities of aikit.

It shows the several things involved. If you just want to run it you should use the automl launcher

Let’s start by loading some small data

[1]:

import warnings
warnings.filterwarnings('ignore')

import pandas as pd

from aikit.datasets.datasets import load_dataset,DatasetEnum
dfX, y, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
dfX.head()

[1]:

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home_dest
0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S	NaN	175.0	Dorchester, MA
1	1	Fortune, Mr. Mark	male	64.0	1	4	19950	263.0000	C23 C25 C27	S	NaN	NaN	Winnipeg, MB
2	1	Sagesser, Mlle. Emma	female	24.0	0	0	PC 17477	69.3000	B35	C	9	NaN	NaN
3	3	Panula, Master. Urho Abraham	male	2.0	4	1	3101295	39.6875	NaN	S	NaN	NaN	NaN
4	1	Maioni, Miss. Roberta	female	16.0	0	0	110152	86.5000	B79	S	8	NaN	NaN

[2]:

y[0:5]

[2]:

array([0, 0, 1, 0, 1])

Now let’s load what is needed

[4]:

from aikit.ml_machine import AutoMlConfig, JobConfig,  MlJobManager, MlJobRunner, AutoMlResultReader
from aikit.ml_machine import FolderDataPersister, SavingType, AutoMlModelGuider

AutoML configuration object¶

This object will contain all the relevant information about the problem at hand : * it’s type : REGRESSION or CLASSIFICATION * the information about the column in the data * the steps that are needed in the processing pipeline (see explanation after) * the models that are to be tested * …

By default the model will guess everything but everything can be changed if needed

[5]:

auto_ml_config = AutoMlConfig(dfX = dfX, y = y, name = "titanic")
auto_ml_config.guess_everything()
auto_ml_config

[5]:

<aikit.ml_machine.ml_machine.AutoMlConfig object at 0x10c90fa90>
type of problem : CLASSIFICATION

type of problem¶

[6]:

auto_ml_config.type_of_problem

[6]:

'CLASSIFICATION'

The config guess that it was a Classification problem

information about columns¶

[7]:

auto_ml_config.columns_informations

[7]:

OrderedDict([('pclass',
              {'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),
             ('name',
              {'TypeOfVariable': 'TEXT', 'HasMissing': False, 'ToKeep': True}),
             ('sex',
              {'TypeOfVariable': 'CAT', 'HasMissing': False, 'ToKeep': True}),
             ('age',
              {'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),
             ('sibsp',
              {'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),
             ('parch',
              {'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),
             ('ticket',
              {'TypeOfVariable': 'TEXT', 'HasMissing': False, 'ToKeep': True}),
             ('fare',
              {'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),
             ('cabin',
              {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),
             ('embarked',
              {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),
             ('boat',
              {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),
             ('body',
              {'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),
             ('home_dest',
              {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True})])

[8]:

pd.DataFrame(auto_ml_config.columns_informations).T

[8]:

	HasMissing	ToKeep	TypeOfVariable
pclass	False	True	NUM
name	False	True	TEXT
sex	False	True	CAT
age	True	True	NUM
sibsp	False	True	NUM
parch	False	True	NUM
ticket	False	True	TEXT
fare	True	True	NUM
cabin	True	True	CAT
embarked	True	True	CAT
boat	True	True	CAT
body	True	True	NUM
home_dest	True	True	CAT

For each column in the DataFrame, its type were guess among the three possible values : * NUM : for numerical columns * TEXT : for columns that contains text * CAT : for categorical columns

Remarks: * The difference between TEXT and CAT is based on the number of different modalities * Be careful with categorical value that are encoded into integers (algorithm won’t know that it is really a categorical feature)

columns block¶

[9]:

auto_ml_config.columns_block

[9]:

OrderedDict([('NUM', ['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']),
             ('TEXT', ['name', 'ticket']),
             ('CAT', ['sex', 'cabin', 'embarked', 'boat', 'home_dest'])])

the ml machine has the notion of block of columns. For some use-case features naturally falls into blocks. By default the tool will use the type of feature has blocks. But other things can be used.

The ml machine will sometimes try to create a model without a block

needed steps¶

[10]:

auto_ml_config.needed_steps

[10]:

[{'step': 'TextPreprocessing', 'optional': True},
 {'step': 'TextEncoder', 'optional': False},
 {'step': 'TextDimensionReduction', 'optional': True},
 {'step': 'CategoryEncoder', 'optional': False},
 {'step': 'MissingValueImputer', 'optional': False},
 {'step': 'Scaling', 'optional': True},
 {'step': 'DimensionReduction', 'optional': True},
 {'step': 'FeatureExtraction', 'optional': True},
 {'step': 'FeatureSelection', 'optional': True},
 {'step': 'Model', 'optional': False}]

The ml machine will create processing pipeline by assembling different steps. Here are the steps it will use for that use case :

TextPreprocessing
TextEncoder : encoding of text into numerical values
TextDimensionReduction : specific dimension reduction for text based features
CategoryEncoder : encoder of categorical data
MissingValueImputer : since there are missing value they need to be filled
Scaling : step to re-scale features
DimensionReduction : generic dimension reduction
FeatureExtraction : create new features
FeatureSelction : select feature
Model : the final classification/regression model

models to keep¶

[11]:

auto_ml_config.models_to_keep

[11]:

[('Model', 'LogisticRegression'),
 ('Model', 'RandomForestClassifier'),
 ('Model', 'ExtraTreesClassifier'),
 ('Model', 'LGBMClassifier'),
 ('FeatureSelection', 'FeaturesSelectorClassifier'),
 ('TextEncoder', 'CountVectorizerWrapper'),
 ('TextEncoder', 'Word2VecVectorizer'),
 ('TextEncoder', 'Char2VecVectorizer'),
 ('TextPreprocessing', 'TextNltkProcessing'),
 ('TextPreprocessing', 'TextDefaultProcessing'),
 ('TextPreprocessing', 'TextDigitAnonymizer'),
 ('CategoryEncoder', 'NumericalEncoder'),
 ('CategoryEncoder', 'TargetEncoderClassifier'),
 ('MissingValueImputer', 'NumImputer'),
 ('DimensionReduction', 'TruncatedSVDWrapper'),
 ('DimensionReduction', 'PCAWrapper'),
 ('TextDimensionReduction', 'TruncatedSVDWrapper'),
 ('DimensionReduction', 'KMeansTransformer'),
 ('Scaling', 'CdfScaler')]

This give us the list of models/transformers to test at each steps.

Remarks: * some steps are removed because they have no transformer yet

job configuration¶

[12]:

job_config = JobConfig()
job_config.guess_cv(auto_ml_config = auto_ml_config, n_splits = 10)
job_config.guess_scoring(auto_ml_config = auto_ml_config)
job_config.score_base_line = None

[13]:

job_config.scoring

[13]:

['accuracy', 'log_loss_patched', 'avg_roc_auc', 'f1_macro']

[14]:

job_config.cv

[14]:

StratifiedKFold(n_splits=10, random_state=123, shuffle=True)

[15]:

job_config.main_scorer

[15]:

'accuracy'

[16]:

job_config.score_base_line

The baseline can be setted if we know what a good performance is. It will be used to specify the threshold bellow which we stop crossvalidation in the first fold

This object has the specific configuration for the job to do : * how to cross validate * what scoring/benchmark to use

Data Persister¶

To synchronize processes and to save values, we need an object to take of that.

This object is a DataPersister, which save everything on disk (Other persister using database might be created)

[ ]:

base_folder = # INSERT PATH HERE
data_persister = FolderDataPersister(base_folder = base_folder)

controller¶

[ ]:

result_reader = AutoMlResultReader(data_persister)
auto_ml_guider = AutoMlModelGuider(result_reader = result_reader,
                                       job_config = job_config,
                                       metric_transformation="default",
                                       avg_metric=True
                                       )

job_controller = MlJobManager(auto_ml_config = auto_ml_config,
                                job_config = job_config,
                                auto_ml_guider = auto_ml_guider,
                                data_persister = data_persister)

the search will be driven by a controller process. This process won’t actually train models but it will decide what models should be tried.

Here three object are actually created : * result reader : its job is to read the result of the auto-ml process and aggregate them

auto_ml_guider : its job is to help the controller guide the seach (using a bayesian technic)
job_controller : the controller

All those objects need the ‘data_persister’ object to write/read data

Now the controller can be started using:

job_controller.run() You need to launch in a subprocess

Worker(s)¶

The last things needed is to create worker(s) that will do the actual cross validation. Those worker will : * listen to the controller * does the cross validation of the models they are told * save result

[ ]:

job_runner = MlJobRunner(dfX = dfX ,
                       y = y,
                       groups = None,
                       auto_ml_config = auto_ml_config,
                       job_config = job_config,
                       data_persister = data_persister)

as before the controller can be started using : job_runner.run()

You need to launcher that in a Subprocess or a Thread

Result Reader¶

After a few models were tested you can see the result, for that you need the ‘result_reader’ (which I re-create here for simplicity)

[ ]:

base_folder = # INSERT path here
data_persister = FolderDataPersister(base_folder = base_folder)

result_reader = AutoMlResultReader(data_persister)

[ ]:

df_results = result_reader.load_all_results()
df_params  = result_reader.load_all_params()
df_errors  = result_reader.load_all_errors()

df_results : DataFrame with the scoring results
df_params : DataFrame with the parameters of the complete processing pipeline
df_errors : DataFrame with the errors

All those DataFrames can be joined using the common ‘job_id’ column

[ ]:

df_merged_result = pd.merge( df_params, df_results, how = "inner",on = "job_id")
df_merged_error  = pd.merge( df_params, df_errors , how = "inner",on = "job_id")

And result can be writted in an Excel file (for example)

[ ]:

try:
    df_merged_result.to_excel(base_folder + "/result.xlsx",index=False)
except OSError:
    print("I couldn't save excel file")

try:
    df_merged_error.to_excel(base_folder + "/result_error.xlsx",index=False)
except OSError:
    print("I couldn't save excel file")

Load a given model¶

[ ]:

from aikit.ml_machine import FolderDataPersister, SavingType
from aikit.model_definition import sklearn_model_from_param

base_folder = # INSERT path here
data_persister = FolderDataPersister(base_folder = base_folder)

[ ]:

job_id    = # INSERT job_id here
job_param = data_persister.read(job_id, path = "job_param", write_type = SavingType.json)
job_param

[ ]:

model = sklearn_model_from_param(job_param["model_json"])
model

[ ]: