Auto ML¶
This notebook will explain the auto-ml capabilities of aikit.
It shows the several things involved. If you just want to run it you should use the automl launcher
Let’s start by loading some small data
[1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from aikit.datasets.datasets import load_dataset,DatasetEnum
dfX, y, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
dfX.head()
[1]:
pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home_dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | NaN | 175.0 | Dorchester, MA |
1 | 1 | Fortune, Mr. Mark | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S | NaN | NaN | Winnipeg, MB |
2 | 1 | Sagesser, Mlle. Emma | female | 24.0 | 0 | 0 | PC 17477 | 69.3000 | B35 | C | 9 | NaN | NaN |
3 | 3 | Panula, Master. Urho Abraham | male | 2.0 | 4 | 1 | 3101295 | 39.6875 | NaN | S | NaN | NaN | NaN |
4 | 1 | Maioni, Miss. Roberta | female | 16.0 | 0 | 0 | 110152 | 86.5000 | B79 | S | 8 | NaN | NaN |
[2]:
y[0:5]
[2]:
array([0, 0, 1, 0, 1])
Now let’s load what is needed
[4]:
from aikit.ml_machine import AutoMlConfig, JobConfig, MlJobManager, MlJobRunner, AutoMlResultReader
from aikit.ml_machine import FolderDataPersister, SavingType, AutoMlModelGuider
AutoML configuration object¶
This object will contain all the relevant information about the problem at hand : * it’s type : REGRESSION or CLASSIFICATION * the information about the column in the data * the steps that are needed in the processing pipeline (see explanation after) * the models that are to be tested * …
By default the model will guess everything but everything can be changed if needed
[5]:
auto_ml_config = AutoMlConfig(dfX = dfX, y = y, name = "titanic")
auto_ml_config.guess_everything()
auto_ml_config
[5]:
<aikit.ml_machine.ml_machine.AutoMlConfig object at 0x10c90fa90>
type of problem : CLASSIFICATION
type of problem¶
[6]:
auto_ml_config.type_of_problem
[6]:
'CLASSIFICATION'
The config guess that it was a Classification problem
information about columns¶
[7]:
auto_ml_config.columns_informations
[7]:
OrderedDict([('pclass',
{'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),
('name',
{'TypeOfVariable': 'TEXT', 'HasMissing': False, 'ToKeep': True}),
('sex',
{'TypeOfVariable': 'CAT', 'HasMissing': False, 'ToKeep': True}),
('age',
{'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),
('sibsp',
{'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),
('parch',
{'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),
('ticket',
{'TypeOfVariable': 'TEXT', 'HasMissing': False, 'ToKeep': True}),
('fare',
{'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),
('cabin',
{'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),
('embarked',
{'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),
('boat',
{'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),
('body',
{'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),
('home_dest',
{'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True})])
[8]:
pd.DataFrame(auto_ml_config.columns_informations).T
[8]:
HasMissing | ToKeep | TypeOfVariable | |
---|---|---|---|
pclass | False | True | NUM |
name | False | True | TEXT |
sex | False | True | CAT |
age | True | True | NUM |
sibsp | False | True | NUM |
parch | False | True | NUM |
ticket | False | True | TEXT |
fare | True | True | NUM |
cabin | True | True | CAT |
embarked | True | True | CAT |
boat | True | True | CAT |
body | True | True | NUM |
home_dest | True | True | CAT |
For each column in the DataFrame, its type were guess among the three possible values : * NUM : for numerical columns * TEXT : for columns that contains text * CAT : for categorical columns
Remarks: * The difference between TEXT and CAT is based on the number of different modalities * Be careful with categorical value that are encoded into integers (algorithm won’t know that it is really a categorical feature)
columns block¶
[9]:
auto_ml_config.columns_block
[9]:
OrderedDict([('NUM', ['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']),
('TEXT', ['name', 'ticket']),
('CAT', ['sex', 'cabin', 'embarked', 'boat', 'home_dest'])])
the ml machine has the notion of block of columns. For some use-case features naturally falls into blocks. By default the tool will use the type of feature has blocks. But other things can be used.
The ml machine will sometimes try to create a model without a block
needed steps¶
[10]:
auto_ml_config.needed_steps
[10]:
[{'step': 'TextPreprocessing', 'optional': True},
{'step': 'TextEncoder', 'optional': False},
{'step': 'TextDimensionReduction', 'optional': True},
{'step': 'CategoryEncoder', 'optional': False},
{'step': 'MissingValueImputer', 'optional': False},
{'step': 'Scaling', 'optional': True},
{'step': 'DimensionReduction', 'optional': True},
{'step': 'FeatureExtraction', 'optional': True},
{'step': 'FeatureSelection', 'optional': True},
{'step': 'Model', 'optional': False}]
The ml machine will create processing pipeline by assembling different steps. Here are the steps it will use for that use case :
- TextPreprocessing
- TextEncoder : encoding of text into numerical values
- TextDimensionReduction : specific dimension reduction for text based features
- CategoryEncoder : encoder of categorical data
- MissingValueImputer : since there are missing value they need to be filled
- Scaling : step to re-scale features
- DimensionReduction : generic dimension reduction
- FeatureExtraction : create new features
- FeatureSelction : select feature
- Model : the final classification/regression model
models to keep¶
[11]:
auto_ml_config.models_to_keep
[11]:
[('Model', 'LogisticRegression'),
('Model', 'RandomForestClassifier'),
('Model', 'ExtraTreesClassifier'),
('Model', 'LGBMClassifier'),
('FeatureSelection', 'FeaturesSelectorClassifier'),
('TextEncoder', 'CountVectorizerWrapper'),
('TextEncoder', 'Word2VecVectorizer'),
('TextEncoder', 'Char2VecVectorizer'),
('TextPreprocessing', 'TextNltkProcessing'),
('TextPreprocessing', 'TextDefaultProcessing'),
('TextPreprocessing', 'TextDigitAnonymizer'),
('CategoryEncoder', 'NumericalEncoder'),
('CategoryEncoder', 'TargetEncoderClassifier'),
('MissingValueImputer', 'NumImputer'),
('DimensionReduction', 'TruncatedSVDWrapper'),
('DimensionReduction', 'PCAWrapper'),
('TextDimensionReduction', 'TruncatedSVDWrapper'),
('DimensionReduction', 'KMeansTransformer'),
('Scaling', 'CdfScaler')]
This give us the list of models/transformers to test at each steps.
Remarks: * some steps are removed because they have no transformer yet
job configuration¶
[12]:
job_config = JobConfig()
job_config.guess_cv(auto_ml_config = auto_ml_config, n_splits = 10)
job_config.guess_scoring(auto_ml_config = auto_ml_config)
job_config.score_base_line = None
[13]:
job_config.scoring
[13]:
['accuracy', 'log_loss_patched', 'avg_roc_auc', 'f1_macro']
[14]:
job_config.cv
[14]:
StratifiedKFold(n_splits=10, random_state=123, shuffle=True)
[15]:
job_config.main_scorer
[15]:
'accuracy'
[16]:
job_config.score_base_line
The baseline can be setted if we know what a good performance is. It will be used to specify the threshold bellow which we stop crossvalidation in the first fold
This object has the specific configuration for the job to do : * how to cross validate * what scoring/benchmark to use
Data Persister¶
To synchronize processes and to save values, we need an object to take of that.
This object is a DataPersister, which save everything on disk (Other persister using database might be created)
[ ]:
base_folder = # INSERT PATH HERE
data_persister = FolderDataPersister(base_folder = base_folder)
controller¶
[ ]:
result_reader = AutoMlResultReader(data_persister)
auto_ml_guider = AutoMlModelGuider(result_reader = result_reader,
job_config = job_config,
metric_transformation="default",
avg_metric=True
)
job_controller = MlJobManager(auto_ml_config = auto_ml_config,
job_config = job_config,
auto_ml_guider = auto_ml_guider,
data_persister = data_persister)
the search will be driven by a controller process. This process won’t actually train models but it will decide what models should be tried.
Here three object are actually created : * result reader : its job is to read the result of the auto-ml process and aggregate them
- auto_ml_guider : its job is to help the controller guide the seach (using a bayesian technic)
- job_controller : the controller
All those objects need the ‘data_persister’ object to write/read data
Now the controller can be started using:
job_controller.run() You need to launch in a subprocess
Worker(s)¶
The last things needed is to create worker(s) that will do the actual cross validation. Those worker will : * listen to the controller * does the cross validation of the models they are told * save result
[ ]:
job_runner = MlJobRunner(dfX = dfX ,
y = y,
groups = None,
auto_ml_config = auto_ml_config,
job_config = job_config,
data_persister = data_persister)
as before the controller can be started using : job_runner.run()
You need to launcher that in a Subprocess or a Thread
Result Reader¶
After a few models were tested you can see the result, for that you need the ‘result_reader’ (which I re-create here for simplicity)
[ ]:
base_folder = # INSERT path here
data_persister = FolderDataPersister(base_folder = base_folder)
result_reader = AutoMlResultReader(data_persister)
[ ]:
df_results = result_reader.load_all_results()
df_params = result_reader.load_all_params()
df_errors = result_reader.load_all_errors()
- df_results : DataFrame with the scoring results
- df_params : DataFrame with the parameters of the complete processing pipeline
- df_errors : DataFrame with the errors
All those DataFrames can be joined using the common ‘job_id’ column
[ ]:
df_merged_result = pd.merge( df_params, df_results, how = "inner",on = "job_id")
df_merged_error = pd.merge( df_params, df_errors , how = "inner",on = "job_id")
And result can be writted in an Excel file (for example)
[ ]:
try:
df_merged_result.to_excel(base_folder + "/result.xlsx",index=False)
except OSError:
print("I couldn't save excel file")
try:
df_merged_error.to_excel(base_folder + "/result_error.xlsx",index=False)
except OSError:
print("I couldn't save excel file")
Load a given model¶
[ ]:
from aikit.ml_machine import FolderDataPersister, SavingType
from aikit.model_definition import sklearn_model_from_param
base_folder = # INSERT path here
data_persister = FolderDataPersister(base_folder = base_folder)
[ ]:
job_id = # INSERT job_id here
job_param = data_persister.read(job_id, path = "job_param", write_type = SavingType.json)
job_param
[ ]:
model = sklearn_model_from_param(job_param["model_json"])
model
[ ]: