How to Wrap a transformer

To wrap a new model you should
  1. Create a new class that inherit from ModelWrapper
  2. In the __init__ of that class specify the rules of the wrapper (see just after)
  3. create a _get_model method to specify the underlying transformers
class aikit.transformers.model_wrapper.ModelWrapper(columns_to_use, work_on_one_column_only, all_columns_at_once, accepted_input_types, remove_sparse_serie, column_prefix, desired_output_type, must_transform_to_get_features_name, dont_change_columns, drop_used_columns=True, drop_unused_columns=True, regex_match=False)

This is a generic class to help wrapping existing transformer and make them more robust

Parameters:
  • columns_to_use (None or list of string) – this parameters will allow the wrapped transformer to select its columns
  • work_on_one_column_only (boolean) – if True tells that the underlying transformer works with 1 dimensinal data (pd.Serie for example)
  • all_columns_at_once (boolean) – if False it tells that the underlying transformer only know how to work one a singular column This is the case for sklearn CountVectorizer for example. If that is the case the wrapped model will work one several column has well (a clone of the underlying model will be create for each column)
  • accepted_input_types (list of DataType) – tells what is accepted by the underlying transformer, a conversion will be made if the input type is not among that list if None nothing is done
  • remove_sparse_serie (bool) – if True will remove Sparse Serie from DataFrame
  • column_prefix (str or None) – if we want the features_names to be prefixed by something like ‘SVD_’ or ‘BAG_’ (for TruncatedSVD or CountVectorizer)
  • desired_output_type (None or DataType) – specify the desired output type of transformer, a conversion will be made if necesary
  • must_transform_to_get_features_name (boolean) – specify if the transformer should transform its data in order to get its features names. Ideally the underlying transformer should implement a ‘get_features_names’ method but sometimes the features names are only retrieve from the column of the transformed DataFrame
  • dont_change_columns (boolean) – indicate that the transformer doesn’t change the column (for example a StandardScaler) if that is the case you know that the resulting feature are the input feature
  • drop_used_columns (boolean, default=True) – what to do with the ORIGINAL columns that were transformed. If False, will keep them in the result (un-transformed) If True, only the transformed columns are in the result
  • drop_unused_columns (boolean, default=True) – what to do with the column that were not used. if False, will drop them if True, will keep them in the result
  • regex_match (boolean, default = False) – if True will use a regex to match columns otherwise exact match
A few notes:
  • must_transform_to_get_features_name and dont_change_columns are here to help the wrapped transformers to implement a correct ‘get_feature_names’
  • the wrapped model has a ‘model’ attribute that retrieves the underlying transformer(s)
  • the wrapped model will generate a NotFittedError error when called without being fit first (this behavior is not consistent across all transformers)

Here is an example of how to wrap sklearn CountVectorizer:

class CountVectorizerWrapper(ModelWrapper):
    """ wrapper around sklearn CountVectorizer with additionnal capabilities

     * can select its columns to keep/drop
     * work on more than one columns
     * can return a DataFrame
     * can add a prefix to the name of columns

    """
    def __init__(self,
                 columns_to_use = "all",
                 analyzer = "word",
                 max_df = 1.0,
                 min_df = 1,
                 ngram_range = 1,
                 max_features = None,
                 columns_to_use = None,
                 regex_match    = False,
                 desired_output_type = DataTypes.SparseArray
                 ):

        self.analyzer = analyzer
        self.max_df = max_df
        self.min_df = min_df
        self.ngram_range = ngram_range
        self.columns_to_use = columns_to_use
        self.regex_match    = regex_match
        self.desired_output_type = desired_output_type

        super(CountVectorizerWrapper,self).__init__(
            columns_to_use = columns_to_use,
            regex_match    = regex_match,

            work_on_one_column_only = True,
            all_columns_at_once = False,
            accepted_input_types = (DataTypes.DataFrame,DataTypes.NumpyArray),
            column_prefix = "BAG",
            desired_output_type = desired_output_type,
            must_transform_to_get_features_name = False,
            dont_change_columns = False)


    def _get_model(self,X,y = None):

        if not isinstance(self.ngram_range,(tuple,list)):
            ngram_range = (1,self.ngram_range)
        else:
            ngram_range = self.ngram_range

        ngram_range = tuple(ngram_range)

        return CountVectorizer(analyzer = self.analyzer,
                               max_df = self.max_df,
                               min_df = self.min_df,
                               ngram_range = ngram_range)

And here is an example of how to wrap TruncatedSVD to make it work with DataFrame and create columns features:

class TruncatedSVDWrapper(ModelWrapper):
    """ wrapper around sklearn TruncatedSVD

    * can select its columns to keep/drop
    * work on more than one columns
    * can return a DataFrame
    * can add a prefix to the name of columns

    n_components can be a float, if that is the case it is considered to be a percentage of the total number of columns

    """
    def __init__(self,
                 n_components = 2,
                 columns_to_use = "all",
                 regex_match  = False
                 ):
        self.n_components = n_components
        self.columns_to_use = columns_to_use
        self.regex_match    = regex_match

        super(TruncatedSVDWrapper,self).__init__(
            columns_to_use = columns_to_use,
            regex_match    = regex_match,

            work_on_one_column_only = False,
            all_columns_at_once = True,
            accepted_input_types = None,
            column_prefix = "SVD",
            desired_output_type = DataTypes.DataFrame,
            must_transform_to_get_features_name = True,
            dont_change_columns = False)


    def _get_model(self,X,y = None):

        nbcolumns = _nbcols(X)
        n_components = int_n_components(nbcolumns, self.n_components)

        return TruncatedSVD(n_components = n_components)

What append during the fit

To help understand a little more what goes on, here is a brief summary the fit method

  1. if ‘columns_to_use’ is set, creation and fit of a aikit.transformers.model_wrapper.ColumnsSelector to subset the column
  2. type and shape of input are stored
  3. input is converted if it is not among the list of accepted input types
  4. input is converted to be 1 or 2 dimensions (also depending on what is accepted by the underlying transformer)
  5. underlying transformer is created (using ‘_get_model’) and fitted
  6. logic is applied to try to figure out the features names