How to Wrap a transformer¶

To wrap a new model you should

Create a new class that inherit from ModelWrapper
In the __init__ of that class specify the rules of the wrapper (see just after)
create a _get_model method to specify the underlying transformers

A few notes:

must_transform_to_get_features_name and dont_change_columns are here to help the wrapped transformers to implement a correct ‘get_feature_names’
the wrapped model has a ‘model’ attribute that retrieves the underlying transformer(s)
the wrapped model will generate a NotFittedError error when called without being fit first (this behavior is not consistent across all transformers)

Here is an example of how to wrap sklearn CountVectorizer:

class CountVectorizerWrapper(ModelWrapper):
    """ wrapper around sklearn CountVectorizer with additionnal capabilities

     * can select its columns to keep/drop
     * work on more than one columns
     * can return a DataFrame
     * can add a prefix to the name of columns

    """
    def __init__(self,
                 columns_to_use = "all",
                 analyzer = "word",
                 max_df = 1.0,
                 min_df = 1,
                 ngram_range = 1,
                 max_features = None,
                 columns_to_use = None,
                 regex_match    = False,
                 desired_output_type = DataTypes.SparseArray
                 ):

        self.analyzer = analyzer
        self.max_df = max_df
        self.min_df = min_df
        self.ngram_range = ngram_range
        self.columns_to_use = columns_to_use
        self.regex_match    = regex_match
        self.desired_output_type = desired_output_type

        super(CountVectorizerWrapper,self).__init__(
            columns_to_use = columns_to_use,
            regex_match    = regex_match,

            work_on_one_column_only = True,
            all_columns_at_once = False,
            accepted_input_types = (DataTypes.DataFrame,DataTypes.NumpyArray),
            column_prefix = "BAG",
            desired_output_type = desired_output_type,
            must_transform_to_get_features_name = False,
            dont_change_columns = False)


    def _get_model(self,X,y = None):

        if not isinstance(self.ngram_range,(tuple,list)):
            ngram_range = (1,self.ngram_range)
        else:
            ngram_range = self.ngram_range

        ngram_range = tuple(ngram_range)

        return CountVectorizer(analyzer = self.analyzer,
                               max_df = self.max_df,
                               min_df = self.min_df,
                               ngram_range = ngram_range)

And here is an example of how to wrap TruncatedSVD to make it work with DataFrame and create columns features:

class TruncatedSVDWrapper(ModelWrapper):
    """ wrapper around sklearn TruncatedSVD

    * can select its columns to keep/drop
    * work on more than one columns
    * can return a DataFrame
    * can add a prefix to the name of columns

    n_components can be a float, if that is the case it is considered to be a percentage of the total number of columns

    """
    def __init__(self,
                 n_components = 2,
                 columns_to_use = "all",
                 regex_match  = False
                 ):
        self.n_components = n_components
        self.columns_to_use = columns_to_use
        self.regex_match    = regex_match

        super(TruncatedSVDWrapper,self).__init__(
            columns_to_use = columns_to_use,
            regex_match    = regex_match,

            work_on_one_column_only = False,
            all_columns_at_once = True,
            accepted_input_types = None,
            column_prefix = "SVD",
            desired_output_type = DataTypes.DataFrame,
            must_transform_to_get_features_name = True,
            dont_change_columns = False)


    def _get_model(self,X,y = None):

        nbcolumns = _nbcols(X)
        n_components = int_n_components(nbcolumns, self.n_components)

        return TruncatedSVD(n_components = n_components)

What append during the fit¶

To help understand a little more what goes on, here is a brief summary the fit method

if ‘columns_to_use’ is set, creation and fit of a aikit.transformers.model_wrapper.ColumnsSelector to subset the column

type and shape of input are stored

input is converted if it is not among the list of accepted input types

input is converted to be 1 or 2 dimensions (also depending on what is accepted by the underlying transformer)

underlying transformer is created (using ‘_get_model’) and fitted

logic is applied to try to figure out the features names