Auto ML

This notebook will explain the auto-ml capabilities of aikit.

It shows the several things involved. If you just want to run it you should use the automl launcher

Let’s start by loading some small data

[1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

from aikit.datasets.datasets import load_dataset,DatasetEnum
dfX, y, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
dfX.head()
[1]:
pclass name sex age sibsp parch ticket fare cabin embarked boat body home_dest
0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S NaN 175.0 Dorchester, MA
1 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S NaN NaN Winnipeg, MB
2 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 69.3000 B35 C 9 NaN NaN
3 3 Panula, Master. Urho Abraham male 2.0 4 1 3101295 39.6875 NaN S NaN NaN NaN
4 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5000 B79 S 8 NaN NaN
[2]:
y[0:5]
[2]:
array([0, 0, 1, 0, 1])

Now let’s load what is needed

[4]:
from aikit.ml_machine import AutoMlConfig, JobConfig,  MlJobManager, MlJobRunner, AutoMlResultReader
from aikit.ml_machine import FolderDataPersister, SavingType, AutoMlModelGuider

AutoML configuration object

This object will contain all the relevant information about the problem at hand : * it’s type : REGRESSION or CLASSIFICATION * the information about the column in the data * the steps that are needed in the processing pipeline (see explanation after) * the models that are to be tested * …

By default the model will guess everything but everything can be changed if needed

[5]:
auto_ml_config = AutoMlConfig(dfX = dfX, y = y, name = "titanic")
auto_ml_config.guess_everything()
auto_ml_config
[5]:
<aikit.ml_machine.ml_machine.AutoMlConfig object at 0x10c90fa90>
type of problem : CLASSIFICATION

type of problem

[6]:
auto_ml_config.type_of_problem
[6]:
'CLASSIFICATION'

The config guess that it was a Classification problem

information about columns

[7]:
auto_ml_config.columns_informations
[7]:
OrderedDict([('pclass',
              {'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),
             ('name',
              {'TypeOfVariable': 'TEXT', 'HasMissing': False, 'ToKeep': True}),
             ('sex',
              {'TypeOfVariable': 'CAT', 'HasMissing': False, 'ToKeep': True}),
             ('age',
              {'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),
             ('sibsp',
              {'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),
             ('parch',
              {'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),
             ('ticket',
              {'TypeOfVariable': 'TEXT', 'HasMissing': False, 'ToKeep': True}),
             ('fare',
              {'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),
             ('cabin',
              {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),
             ('embarked',
              {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),
             ('boat',
              {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),
             ('body',
              {'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),
             ('home_dest',
              {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True})])
[8]:
pd.DataFrame(auto_ml_config.columns_informations).T
[8]:
HasMissing ToKeep TypeOfVariable
pclass False True NUM
name False True TEXT
sex False True CAT
age True True NUM
sibsp False True NUM
parch False True NUM
ticket False True TEXT
fare True True NUM
cabin True True CAT
embarked True True CAT
boat True True CAT
body True True NUM
home_dest True True CAT

For each column in the DataFrame, its type were guess among the three possible values : * NUM : for numerical columns * TEXT : for columns that contains text * CAT : for categorical columns

Remarks: * The difference between TEXT and CAT is based on the number of different modalities * Be careful with categorical value that are encoded into integers (algorithm won’t know that it is really a categorical feature)

columns block

[9]:
auto_ml_config.columns_block
[9]:
OrderedDict([('NUM', ['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']),
             ('TEXT', ['name', 'ticket']),
             ('CAT', ['sex', 'cabin', 'embarked', 'boat', 'home_dest'])])

the ml machine has the notion of block of columns. For some use-case features naturally falls into blocks. By default the tool will use the type of feature has blocks. But other things can be used.

The ml machine will sometimes try to create a model without a block

needed steps

[10]:
auto_ml_config.needed_steps
[10]:
[{'step': 'TextPreprocessing', 'optional': True},
 {'step': 'TextEncoder', 'optional': False},
 {'step': 'TextDimensionReduction', 'optional': True},
 {'step': 'CategoryEncoder', 'optional': False},
 {'step': 'MissingValueImputer', 'optional': False},
 {'step': 'Scaling', 'optional': True},
 {'step': 'DimensionReduction', 'optional': True},
 {'step': 'FeatureExtraction', 'optional': True},
 {'step': 'FeatureSelection', 'optional': True},
 {'step': 'Model', 'optional': False}]

The ml machine will create processing pipeline by assembling different steps. Here are the steps it will use for that use case :

  • TextPreprocessing
  • TextEncoder : encoding of text into numerical values
  • TextDimensionReduction : specific dimension reduction for text based features
  • CategoryEncoder : encoder of categorical data
  • MissingValueImputer : since there are missing value they need to be filled
  • Scaling : step to re-scale features
  • DimensionReduction : generic dimension reduction
  • FeatureExtraction : create new features
  • FeatureSelction : select feature
  • Model : the final classification/regression model

models to keep

[11]:
auto_ml_config.models_to_keep
[11]:
[('Model', 'LogisticRegression'),
 ('Model', 'RandomForestClassifier'),
 ('Model', 'ExtraTreesClassifier'),
 ('Model', 'LGBMClassifier'),
 ('FeatureSelection', 'FeaturesSelectorClassifier'),
 ('TextEncoder', 'CountVectorizerWrapper'),
 ('TextEncoder', 'Word2VecVectorizer'),
 ('TextEncoder', 'Char2VecVectorizer'),
 ('TextPreprocessing', 'TextNltkProcessing'),
 ('TextPreprocessing', 'TextDefaultProcessing'),
 ('TextPreprocessing', 'TextDigitAnonymizer'),
 ('CategoryEncoder', 'NumericalEncoder'),
 ('CategoryEncoder', 'TargetEncoderClassifier'),
 ('MissingValueImputer', 'NumImputer'),
 ('DimensionReduction', 'TruncatedSVDWrapper'),
 ('DimensionReduction', 'PCAWrapper'),
 ('TextDimensionReduction', 'TruncatedSVDWrapper'),
 ('DimensionReduction', 'KMeansTransformer'),
 ('Scaling', 'CdfScaler')]

This give us the list of models/transformers to test at each steps.

Remarks: * some steps are removed because they have no transformer yet

job configuration

[12]:
job_config = JobConfig()
job_config.guess_cv(auto_ml_config = auto_ml_config, n_splits = 10)
job_config.guess_scoring(auto_ml_config = auto_ml_config)
job_config.score_base_line = None
[13]:
job_config.scoring
[13]:
['accuracy', 'log_loss_patched', 'avg_roc_auc', 'f1_macro']
[14]:
job_config.cv
[14]:
StratifiedKFold(n_splits=10, random_state=123, shuffle=True)
[15]:
job_config.main_scorer
[15]:
'accuracy'
[16]:
job_config.score_base_line

The baseline can be setted if we know what a good performance is. It will be used to specify the threshold bellow which we stop crossvalidation in the first fold

This object has the specific configuration for the job to do : * how to cross validate * what scoring/benchmark to use

Data Persister

To synchronize processes and to save values, we need an object to take of that.

This object is a DataPersister, which save everything on disk (Other persister using database might be created)

[ ]:
base_folder = # INSERT PATH HERE
data_persister = FolderDataPersister(base_folder = base_folder)

controller

[ ]:
result_reader = AutoMlResultReader(data_persister)
auto_ml_guider = AutoMlModelGuider(result_reader = result_reader,
                                       job_config = job_config,
                                       metric_transformation="default",
                                       avg_metric=True
                                       )

job_controller = MlJobManager(auto_ml_config = auto_ml_config,
                                job_config = job_config,
                                auto_ml_guider = auto_ml_guider,
                                data_persister = data_persister)

the search will be driven by a controller process. This process won’t actually train models but it will decide what models should be tried.

Here three object are actually created : * result reader : its job is to read the result of the auto-ml process and aggregate them

  • auto_ml_guider : its job is to help the controller guide the seach (using a bayesian technic)
  • job_controller : the controller

All those objects need the ‘data_persister’ object to write/read data

Now the controller can be started using:

job_controller.run() You need to launch in a subprocess

Worker(s)

The last things needed is to create worker(s) that will do the actual cross validation. Those worker will : * listen to the controller * does the cross validation of the models they are told * save result

[ ]:
job_runner = MlJobRunner(dfX = dfX ,
                       y = y,
                       groups = None,
                       auto_ml_config = auto_ml_config,
                       job_config = job_config,
                       data_persister = data_persister)

as before the controller can be started using : job_runner.run()

You need to launcher that in a Subprocess or a Thread

Result Reader

After a few models were tested you can see the result, for that you need the ‘result_reader’ (which I re-create here for simplicity)

[ ]:
base_folder = # INSERT path here
data_persister = FolderDataPersister(base_folder = base_folder)

result_reader = AutoMlResultReader(data_persister)

[ ]:
df_results = result_reader.load_all_results()
df_params  = result_reader.load_all_params()
df_errors  = result_reader.load_all_errors()
  • df_results : DataFrame with the scoring results
  • df_params : DataFrame with the parameters of the complete processing pipeline
  • df_errors : DataFrame with the errors

All those DataFrames can be joined using the common ‘job_id’ column

[ ]:
df_merged_result = pd.merge( df_params, df_results, how = "inner",on = "job_id")
df_merged_error  = pd.merge( df_params, df_errors , how = "inner",on = "job_id")

And result can be writted in an Excel file (for example)

[ ]:
try:
    df_merged_result.to_excel(base_folder + "/result.xlsx",index=False)
except OSError:
    print("I couldn't save excel file")

try:
    df_merged_error.to_excel(base_folder + "/result_error.xlsx",index=False)
except OSError:
    print("I couldn't save excel file")

Load a given model

[ ]:
from aikit.ml_machine import FolderDataPersister, SavingType
from aikit.model_definition import sklearn_model_from_param

base_folder = # INSERT path here
data_persister = FolderDataPersister(base_folder = base_folder)

[ ]:
job_id    = # INSERT job_id here
job_param = data_persister.read(job_id, path = "job_param", write_type = SavingType.json)
job_param
[ ]:
model = sklearn_model_from_param(job_param["model_json"])
model
[ ]: