{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Auto ML\n",
"This notebook will explain the auto-ml capabilities of aikit.\n",
"\n",
"It shows the several things involved. If you just want to run it you should use the automl launcher"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start by loading some small data"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pclass | \n",
" name | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" ticket | \n",
" fare | \n",
" cabin | \n",
" embarked | \n",
" boat | \n",
" body | \n",
" home_dest | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" McCarthy, Mr. Timothy J | \n",
" male | \n",
" 54.0 | \n",
" 0 | \n",
" 0 | \n",
" 17463 | \n",
" 51.8625 | \n",
" E46 | \n",
" S | \n",
" NaN | \n",
" 175.0 | \n",
" Dorchester, MA | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" Fortune, Mr. Mark | \n",
" male | \n",
" 64.0 | \n",
" 1 | \n",
" 4 | \n",
" 19950 | \n",
" 263.0000 | \n",
" C23 C25 C27 | \n",
" S | \n",
" NaN | \n",
" NaN | \n",
" Winnipeg, MB | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Sagesser, Mlle. Emma | \n",
" female | \n",
" 24.0 | \n",
" 0 | \n",
" 0 | \n",
" PC 17477 | \n",
" 69.3000 | \n",
" B35 | \n",
" C | \n",
" 9 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" Panula, Master. Urho Abraham | \n",
" male | \n",
" 2.0 | \n",
" 4 | \n",
" 1 | \n",
" 3101295 | \n",
" 39.6875 | \n",
" NaN | \n",
" S | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Maioni, Miss. Roberta | \n",
" female | \n",
" 16.0 | \n",
" 0 | \n",
" 0 | \n",
" 110152 | \n",
" 86.5000 | \n",
" B79 | \n",
" S | \n",
" 8 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass name sex age sibsp parch ticket \\\n",
"0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 \n",
"1 1 Fortune, Mr. Mark male 64.0 1 4 19950 \n",
"2 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 \n",
"3 3 Panula, Master. Urho Abraham male 2.0 4 1 3101295 \n",
"4 1 Maioni, Miss. Roberta female 16.0 0 0 110152 \n",
"\n",
" fare cabin embarked boat body home_dest \n",
"0 51.8625 E46 S NaN 175.0 Dorchester, MA \n",
"1 263.0000 C23 C25 C27 S NaN NaN Winnipeg, MB \n",
"2 69.3000 B35 C 9 NaN NaN \n",
"3 39.6875 NaN S NaN NaN NaN \n",
"4 86.5000 B79 S 8 NaN NaN "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"import pandas as pd\n",
"\n",
"from aikit.datasets.datasets import load_dataset,DatasetEnum\n",
"dfX, y, _ ,_ , _ = load_dataset(DatasetEnum.titanic)\n",
"dfX.head()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 0, 1, 0, 1])"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y[0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's load what is needed "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from aikit.ml_machine import AutoMlConfig, JobConfig, MlJobManager, MlJobRunner, AutoMlResultReader\n",
"from aikit.ml_machine import FolderDataPersister, SavingType, AutoMlModelGuider"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### AutoML configuration object\n",
"This object will contain all the relevant information about the problem at hand :\n",
" * it's type : REGRESSION or CLASSIFICATION\n",
" * the information about the column in the data\n",
" * the steps that are needed in the processing pipeline (see explanation after)\n",
" * the models that are to be tested\n",
" * ...\n",
" \n",
" By default the model will guess everything but everything can be changed if needed"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"type of problem : CLASSIFICATION"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"auto_ml_config = AutoMlConfig(dfX = dfX, y = y, name = \"titanic\")\n",
"auto_ml_config.guess_everything()\n",
"auto_ml_config"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### type of problem"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'CLASSIFICATION'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"auto_ml_config.type_of_problem"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The config guess that it was a Classification problem"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### information about columns"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"OrderedDict([('pclass',\n",
" {'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),\n",
" ('name',\n",
" {'TypeOfVariable': 'TEXT', 'HasMissing': False, 'ToKeep': True}),\n",
" ('sex',\n",
" {'TypeOfVariable': 'CAT', 'HasMissing': False, 'ToKeep': True}),\n",
" ('age',\n",
" {'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),\n",
" ('sibsp',\n",
" {'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),\n",
" ('parch',\n",
" {'TypeOfVariable': 'NUM', 'HasMissing': False, 'ToKeep': True}),\n",
" ('ticket',\n",
" {'TypeOfVariable': 'TEXT', 'HasMissing': False, 'ToKeep': True}),\n",
" ('fare',\n",
" {'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),\n",
" ('cabin',\n",
" {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),\n",
" ('embarked',\n",
" {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),\n",
" ('boat',\n",
" {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True}),\n",
" ('body',\n",
" {'TypeOfVariable': 'NUM', 'HasMissing': True, 'ToKeep': True}),\n",
" ('home_dest',\n",
" {'TypeOfVariable': 'CAT', 'HasMissing': True, 'ToKeep': True})])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"auto_ml_config.columns_informations"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" HasMissing | \n",
" ToKeep | \n",
" TypeOfVariable | \n",
"
\n",
" \n",
" \n",
" \n",
" pclass | \n",
" False | \n",
" True | \n",
" NUM | \n",
"
\n",
" \n",
" name | \n",
" False | \n",
" True | \n",
" TEXT | \n",
"
\n",
" \n",
" sex | \n",
" False | \n",
" True | \n",
" CAT | \n",
"
\n",
" \n",
" age | \n",
" True | \n",
" True | \n",
" NUM | \n",
"
\n",
" \n",
" sibsp | \n",
" False | \n",
" True | \n",
" NUM | \n",
"
\n",
" \n",
" parch | \n",
" False | \n",
" True | \n",
" NUM | \n",
"
\n",
" \n",
" ticket | \n",
" False | \n",
" True | \n",
" TEXT | \n",
"
\n",
" \n",
" fare | \n",
" True | \n",
" True | \n",
" NUM | \n",
"
\n",
" \n",
" cabin | \n",
" True | \n",
" True | \n",
" CAT | \n",
"
\n",
" \n",
" embarked | \n",
" True | \n",
" True | \n",
" CAT | \n",
"
\n",
" \n",
" boat | \n",
" True | \n",
" True | \n",
" CAT | \n",
"
\n",
" \n",
" body | \n",
" True | \n",
" True | \n",
" NUM | \n",
"
\n",
" \n",
" home_dest | \n",
" True | \n",
" True | \n",
" CAT | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" HasMissing ToKeep TypeOfVariable\n",
"pclass False True NUM\n",
"name False True TEXT\n",
"sex False True CAT\n",
"age True True NUM\n",
"sibsp False True NUM\n",
"parch False True NUM\n",
"ticket False True TEXT\n",
"fare True True NUM\n",
"cabin True True CAT\n",
"embarked True True CAT\n",
"boat True True CAT\n",
"body True True NUM\n",
"home_dest True True CAT"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(auto_ml_config.columns_informations).T"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For each column in the DataFrame, its type were guess among the three possible values :\n",
" * NUM : for numerical columns\n",
" * TEXT : for columns that contains text\n",
" * CAT : for categorical columns\n",
"\n",
"Remarks:\n",
" * The difference between TEXT and CAT is based on the number of different modalities\n",
" * Be careful with categorical value that are encoded into integers (algorithm won't know that it is really a categorical feature)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### columns block"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"OrderedDict([('NUM', ['pclass', 'age', 'sibsp', 'parch', 'fare', 'body']),\n",
" ('TEXT', ['name', 'ticket']),\n",
" ('CAT', ['sex', 'cabin', 'embarked', 'boat', 'home_dest'])])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"auto_ml_config.columns_block"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"the ml machine has the notion of block of columns.\n",
"For some use-case features naturally falls into blocks. By default the tool will use the type of feature has blocks. But other things can be used.\n",
"\n",
"The ml machine will sometimes try to create a model without a block"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### needed steps"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'step': 'TextPreprocessing', 'optional': True},\n",
" {'step': 'TextEncoder', 'optional': False},\n",
" {'step': 'TextDimensionReduction', 'optional': True},\n",
" {'step': 'CategoryEncoder', 'optional': False},\n",
" {'step': 'MissingValueImputer', 'optional': False},\n",
" {'step': 'Scaling', 'optional': True},\n",
" {'step': 'DimensionReduction', 'optional': True},\n",
" {'step': 'FeatureExtraction', 'optional': True},\n",
" {'step': 'FeatureSelection', 'optional': True},\n",
" {'step': 'Model', 'optional': False}]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"auto_ml_config.needed_steps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ml machine will create processing pipeline by assembling different steps.\n",
"Here are the steps it will use for that use case :\n",
" \n",
" * TextPreprocessing\n",
" * TextEncoder : encoding of text into numerical values\n",
" * TextDimensionReduction : specific dimension reduction for text based features\n",
" \n",
" * CategoryEncoder : encoder of categorical data\n",
" * MissingValueImputer : since there are missing value they need to be filled\n",
" * Scaling : step to re-scale features\n",
" * DimensionReduction : generic dimension reduction\n",
" \n",
" * FeatureExtraction : create new features\n",
" * FeatureSelction : select feature\n",
" \n",
" * Model : the final classification/regression model\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### models to keep"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('Model', 'LogisticRegression'),\n",
" ('Model', 'RandomForestClassifier'),\n",
" ('Model', 'ExtraTreesClassifier'),\n",
" ('Model', 'LGBMClassifier'),\n",
" ('FeatureSelection', 'FeaturesSelectorClassifier'),\n",
" ('TextEncoder', 'CountVectorizerWrapper'),\n",
" ('TextEncoder', 'Word2VecVectorizer'),\n",
" ('TextEncoder', 'Char2VecVectorizer'),\n",
" ('TextPreprocessing', 'TextNltkProcessing'),\n",
" ('TextPreprocessing', 'TextDefaultProcessing'),\n",
" ('TextPreprocessing', 'TextDigitAnonymizer'),\n",
" ('CategoryEncoder', 'NumericalEncoder'),\n",
" ('CategoryEncoder', 'TargetEncoderClassifier'),\n",
" ('MissingValueImputer', 'NumImputer'),\n",
" ('DimensionReduction', 'TruncatedSVDWrapper'),\n",
" ('DimensionReduction', 'PCAWrapper'),\n",
" ('TextDimensionReduction', 'TruncatedSVDWrapper'),\n",
" ('DimensionReduction', 'KMeansTransformer'),\n",
" ('Scaling', 'CdfScaler')]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"auto_ml_config.models_to_keep"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This give us the list of models/transformers to test at each steps.\n",
"\n",
"Remarks:\n",
"* some steps are removed because they have no transformer yet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### job configuration"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"job_config = JobConfig()\n",
"job_config.guess_cv(auto_ml_config = auto_ml_config, n_splits = 10)\n",
"job_config.guess_scoring(auto_ml_config = auto_ml_config)\n",
"job_config.score_base_line = None"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['accuracy', 'log_loss_patched', 'avg_roc_auc', 'f1_macro']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"job_config.scoring"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"StratifiedKFold(n_splits=10, random_state=123, shuffle=True)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"job_config.cv"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'accuracy'"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"job_config.main_scorer"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"job_config.score_base_line"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The baseline can be setted if we know what a good performance is.\n",
"It will be used to specify the threshold bellow which we stop crossvalidation in the first fold"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This object has the specific configuration for the job to do :\n",
"* how to cross validate\n",
"* what scoring/benchmark to use"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data Persister\n",
"To synchronize processes and to save values, we need an object to take of that.\n",
"\n",
"This object is a DataPersister, which save everything on disk\n",
"(Other persister using database might be created)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"base_folder = # INSERT PATH HERE\n",
"data_persister = FolderDataPersister(base_folder = base_folder)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### controller"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"result_reader = AutoMlResultReader(data_persister)\n",
"auto_ml_guider = AutoMlModelGuider(result_reader = result_reader, \n",
" job_config = job_config,\n",
" metric_transformation=\"default\",\n",
" avg_metric=True\n",
" )\n",
" \n",
"job_controller = MlJobManager(auto_ml_config = auto_ml_config,\n",
" job_config = job_config,\n",
" auto_ml_guider = auto_ml_guider,\n",
" data_persister = data_persister)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"the search will be driven by a controller process. This process won't actually train models but it will decide what models should be tried.\n",
"\n",
"Here three object are actually created :\n",
" * result reader : its job is to read the result of the auto-ml process and aggregate them\n",
" \n",
" * auto_ml_guider : its job is to help the controller guide the seach (using a bayesian technic)\n",
" \n",
" * job_controller : the controller\n",
" \n",
"All those objects need the 'data_persister' object to write/read data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now the controller can be started using:\n",
"\n",
"job_controller.run()\n",
"You need to launch in a subprocess\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Worker(s)\n",
"\n",
"The last things needed is to create worker(s) that will do the actual cross validation.\n",
"Those worker will :\n",
" * listen to the controller\n",
" * does the cross validation of the models they are told\n",
" * save result"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"job_runner = MlJobRunner(dfX = dfX , \n",
" y = y, \n",
" groups = None,\n",
" auto_ml_config = auto_ml_config, \n",
" job_config = job_config,\n",
" data_persister = data_persister)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"as before the controller can be started using :\n",
"job_runner.run()\n",
"\n",
"You need to launcher that in a Subprocess or a Thread"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Result Reader\n",
"After a few models were tested you can see the result, for that you need the 'result_reader' (which I re-create here for simplicity)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"base_folder = # INSERT path here\n",
"data_persister = FolderDataPersister(base_folder = base_folder)\n",
"\n",
"result_reader = AutoMlResultReader(data_persister)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_results = result_reader.load_all_results()\n",
"df_params = result_reader.load_all_params()\n",
"df_errors = result_reader.load_all_errors()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* df_results : DataFrame with the scoring results\n",
"* df_params : DataFrame with the parameters of the complete processing pipeline\n",
"* df_errors : DataFrame with the errors\n",
"\n",
"All those DataFrames can be joined using the common 'job_id' column"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_merged_result = pd.merge( df_params, df_results, how = \"inner\",on = \"job_id\")\n",
"df_merged_error = pd.merge( df_params, df_errors , how = \"inner\",on = \"job_id\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And result can be writted in an Excel file (for example)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" df_merged_result.to_excel(base_folder + \"/result.xlsx\",index=False)\n",
"except OSError:\n",
" print(\"I couldn't save excel file\")\n",
"\n",
"try:\n",
" df_merged_error.to_excel(base_folder + \"/result_error.xlsx\",index=False)\n",
"except OSError:\n",
" print(\"I couldn't save excel file\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load a given model ####"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from aikit.ml_machine import FolderDataPersister, SavingType\n",
"from aikit.model_definition import sklearn_model_from_param\n",
"\n",
"base_folder = # INSERT path here\n",
"data_persister = FolderDataPersister(base_folder = base_folder)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"job_id = # INSERT job_id here \n",
"job_param = data_persister.read(job_id, path = \"job_param\", write_type = SavingType.json)\n",
"job_param"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model = sklearn_model_from_param(job_param[\"model_json\"])\n",
"model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}