Data Structure Helper

Data Types

Aikit helps dealing with the multiple type of data that coexist within the scikit-learn, pandas, numpy and scipy environments. Mainly :

  • pandas DataFrame
  • pandas Sparse DataFrame
  • numpy array
  • scipy sparse array (csc, csr, coo, …)

The library offers tools to easily convert between each type.

Within aikit.enums there is a DataType enumeration with the following values :

  • ‘DataFrame’
  • ‘SparseDataFrame’
  • ‘Series’
  • ‘NumpyArray’
  • ‘SparseArray’

This is better use as an enumeration but the values are actual strings so you can use the string directly if needed.

The function aikit.tools.data_structure_helper.get_type retrieve the type (one element of the element).

Example of use:

from aikit.tools.data_structure_helper import get_type, DataTypes
df = pd.DataFrame({"a":[0,1,2],"b":["aaa","bbb","ccc"]})
mapped_type = get_type(df)

if mapped_type == DataTypes.DataFrame:
    first_column = df.loc[:,"a"]
else:
    first_column = df[:,0]

Generic Conversion

You can also convert each type to the desired type. This can be useful if a transformer only accepts DataFrames, or doesn’t work with a Sparse Array, … For that use the function aikit.tools.data_structure_helper.convert_generic

aikit.tools.data_structure_helper.convert_generic(xx, mapped_type=None, output_type=None)

generic conversion function

Parameters:
  • xx (array, DataFrame, ..) – the object to convert
  • mapped_type (enumeration from enums.DataTypes or None) – if not None, the type enumeration of xx
  • output_type (enumeration from enums.DataTypes or None) – if not None the desired output tpye

Example:

from aikit.tools.data_structure_helper import convert_generic, DataTypes
df = pd.DataFrame({"a":[0,1,2],"b":["aaa","bbb","ccc"]})

arr = convert_generic(df, output_type = DataTypes.NumpyArray)

Generic Horizontal Stacking

You can also horizontally concatenate multiple datas together (assuming they have the same number of rows). You can either specify the output type you want, if that is not the case the algorithm will guess :

  • if same type will use that type
  • if DataFrame and Array use DataFrame
  • if Sparse and Non Sparse : convert to full if not to big otherwise keep Sparse

(See aikit.tools.data_structure_helper.guess_output_type)

The function to concatenate everything is aikit.tools.data_structure_helper.generic_hstack

Example:

df1 = pd.DataFrame({"a":list(range(10)),"b":["aaaa","bbbbb","cccc"]*3 + ["ezzzz"]})
df2 = pd.DataFrame({"c":list(range(10)),"d":["aaaa","bbbbb","cccc"]*3 + ["ezzzz"]})

df12 = generic_hstack((df1,df2))
aikit.tools.data_structure_helper.generic_hstack(all_datas, output_type=None, all_columns_names=None, max_number_of_cells_for_non_sparse=10000000)

generic function to concatenate horizontaly some datas objects

All datas should have the same number of rows

Parameters:
  • all_datas (list of data object) – the things that we want to concatenate
  • output_type (None or type of data) – if None will guess the type (See ‘guess_output_type’) otherwise will concatenate using that format
  • = None or list of names (all_columns_names) – if not None it corresponds to the list of columns of each sub datas
Returns:

Return type:

aggregated object

Other

Two other functions that can be useful are aikit.tools.data_structure_helper.make1dimension and aikit.tools.data_structure_helper.make2dimensions. It convert to a 1 dimensional or 2 dimensional object whenever possible.

aikit.tools.data_structure_helper.make1dimension(X)

generic function to make an object uni dimensional

aikit.tools.data_structure_helper.make2dimensions(X)

generic function to make a data object at least bi-dimensional

Example

>>> df = pd.DataFrame({"a":np.arange(10),"b":["aa","bb","cc"]*3 + ["dd"]})
>>> assert make2dimensions(df).shape == (10,2)
>>> assert make2dimensions(df["a"]).shape == (10,1)
>>> assert make2dimensions(df.values).shape == (10,2)
>>> assert make2dimensions(df["a"].values).shape == (10,1)