Auto ML Manual Launch

Simple Launch

Here are the steps to launch a test. For simplicity you can also use an MlMachineLauncher object (see _ml_machine_launcher)

Load a dataset ‘dfX’ and its target ‘y’ and decide a ‘name’ for the experiment. Then create an AutoMlConfig object:

from aikit.ml_machine import AutoMlConfig, JobConfig
auto_ml_config = AutoMlConfig(dfX=dfX, y=y, name=name)
auto_ml_config.guess_everything()

This object will contain all the configurations of the problem. Then create a JobConfig object:

job_config = JobConfig()
job_config.guess_cv(auto_ml_config=auto_ml_config, n_splits=10)
job_config.guess_scoring(auto_ml_config=auto_ml_config)

The data (dfX and y) can be deleted from the AutoMlConfig after everything is guessed to save memory:

auto_ml_config.dfX = None
auto_ml_config.y = None

If you have an idea of a base line score you can set it:

job_config.score_base_line = 0.95

Create a DataPersister object (for now only a FolderDataPersister object is available but other persisters using database should be possible):

from aikit.ml_machine import FolderDataPersister
base_folder = "automl_folder_experiment_1"
data_persister = FolderDataPersister(base_folder=base_folder)

Now everything is ready, a JobController can be created and started (this controller can use a AutoMlModelGuider to help drive the random search):

from aikit.ml_machine import AutoMlResultReader, AutoMlModelGuider, MlJobManager, MlJobRunner
result_reader = AutoMlResultReader(data_persister)

auto_ml_guider = AutoMlModelGuider(result_reader=result_reader,
                                   job_config=job_config,
                                   metric_transformation="default",
                                   avg_metric=True)

job_controller = MlJobManager(auto_ml_config=auto_ml_config,
                            job_config=job_config,
                            auto_ml_guider=auto_ml_guider,
                            data_persister=data_persister)

Same thing one (or more workers) can be created:

job_runner = MlJobRunner(dfX=dfX,
                   y=y,
                   auto_ml_config=auto_ml_config,
                   job_config=job_config,
                   data_persister=data_persister)

To start the controller just do:

job_controller.run()

To start the worker just do:

job_runner.run()
Remark :
  • the runner and the worker(s) should be started seperately (for example the controller in a special thread, or in a special process).
  • the controller doesn’t need the data (dfX and y) and they can be deletted from the AutoMlConfig after everything is guessed

Result Aggregation

After a while (whenever you want actually) you can aggregate the results and see them. The most simple way to do that is to aggregate everything into an Excel file. Create an AutoMlResultReader:

from aikit.ml_machine import AutoMlResultReader, FolderDataPersister

base_folder = "automl_folder_experiment_1"
data_persister = FolderDataPersister(base_folder = base_folder)

result_reader = AutoMlResultReader(data_persister)

Then retrieve everything:

   df_results = result_reader.load_all_results()
   df_params  = result_reader.load_all_params()
   df_errors  = result_reader.load_all_errors()

* df_results is a DataFrame with all the results (cv scores)
* df_params is a DataFrame with all the parameters of each models
* df_error is a DataFrame with all the errors that arises when fitting the models

Everything is linked by the ‘job_id’ column and can be merged:

df_merged_result = pd.merge(df_params, df_results, how="inner", on="job_id")
df_merged_error  = pd.merge(df_params, df_errors , how="inner", on="job_id")

And saved into an Excel file:

df_merged_result.to_excel(base_folder + "/result.xlsx", index=False)

Result File

automl result file

Here are the parts of the file:

  • job_id : idea of the current model (this is id is used everywhere)
  • hasblock_** columns : indicate whether or not a given block of column were used
  • steps columns indicating which transformer/model were used for each step (Example : “CategoricalEncoder”:”NumericalEncoder”)
  • hyper-parameter columns : indicate the hyperparameters for each model/transformers
  • test_** : scoring metrics on testing data (ie: out of fold)
  • train_** : scoring metrics on training data (ie: in fold)
  • time, time_score : time to fit and score each model
  • nb : number of fold (not always max because sometime we don’t do the full crossvalidation if performance are not good)
automl result file
automl result file