Choice of columns

[1]:
from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)

from aikit.transformers import NumericalEncoder

C:\HOMEWARE\Anaconda3-Windows-x86_64\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
[2]:
Xtrain
[2]:
pclass name sex age sibsp parch ticket fare cabin embarked boat body home_dest
0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S NaN 175.0 Dorchester, MA
1 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S NaN NaN Winnipeg, MB
2 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 69.3000 B35 C 9 NaN NaN
3 3 Panula, Master. Urho Abraham male 2.0 4 1 3101295 39.6875 NaN S NaN NaN NaN
4 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5000 B79 S 8 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1043 2 Sobey, Mr. Samuel James Hayden male 25.0 0 0 C.A. 29178 13.0000 NaN S NaN NaN Cornwall / Houghton, MI
1044 1 Ryerson, Master. John Borie male 13.0 2 2 PC 17608 262.3750 B57 B59 B63 B66 C 4 NaN Haverford, PA / Cooperstown, NY
1045 2 Lahtinen, Rev. William male 30.0 1 1 250651 26.0000 NaN S NaN NaN Minneapolis, MN
1046 3 Drazenoic, Mr. Jozef male 33.0 0 0 349241 7.8958 NaN C NaN 51.0 Austria Niagara Falls, NY
1047 2 Hosono, Mr. Masabumi male 42.0 0 0 237798 13.0000 NaN S 10 NaN Tokyo, Japan

1048 rows × 13 columns

[3]:
encoder = NumericalEncoder(columns_to_use=["sex","home_dest"])
Xencoded = encoder.fit_transform(Xtrain)
Xencoded.head()

[3]:
pclass name age sibsp parch ticket fare cabin embarked boat body sex__male sex__female home_dest____null__ home_dest__New York, NY home_dest__London home_dest____default__
0 1 McCarthy, Mr. Timothy J 54.0 0 0 17463 51.8625 E46 S NaN 175.0 1 0 0 0 0 1
1 1 Fortune, Mr. Mark 64.0 1 4 19950 263.0000 C23 C25 C27 S NaN NaN 1 0 0 0 0 1
2 1 Sagesser, Mlle. Emma 24.0 0 0 PC 17477 69.3000 B35 C 9 NaN 0 1 1 0 0 0
3 3 Panula, Master. Urho Abraham 2.0 4 1 3101295 39.6875 NaN S NaN NaN 1 0 1 0 0 0
4 1 Maioni, Miss. Roberta 16.0 0 0 110152 86.5000 B79 S 8 NaN 0 1 1 0 0 0

Called like this the transformer encodes “sex” and “home_dest” and keeps th other columns untouched

[4]:
encoder = NumericalEncoder(columns_to_use=["sex","home_dest"], drop_unused_columns=True)
Xencoded = encoder.fit_transform(Xtrain)
print(Xencoded.shape)
Xencoded.head()

(1048, 6)
[4]:
sex__male sex__female home_dest____null__ home_dest__New York, NY home_dest__London home_dest____default__
0 1 0 0 0 0 1
1 1 0 0 0 0 1
2 0 1 1 0 0 0
3 1 0 1 0 0 0
4 0 1 1 0 0 0

Called like this the transformer encodes “sex” and “home_dest” and drop the other columns

[5]:
Xtrain["home_dest"].value_counts()
[5]:
New York, NY                                47
London                                      11
Cornwall / Akron, OH                         9
Winnipeg, MB                                 7
Montreal, PQ                                 7
                                            ..
London / Birmingham                          1
Folkstone, Kent / New York, NY               1
Treherbert, Cardiff, Wales                   1
Devonport, England                           1
Buenos Aires, Argentina / New Jersey, NJ     1
Name: home_dest, Length: 333, dtype: int64

Only the most frequent modalities are kept (this can be changed)

[6]:
encoder = NumericalEncoder(columns_to_use=["sex","home_dest"],
                           drop_unused_columns=True,
                           min_modalities_number=400)
Xencoded = encoder.fit_transform(Xtrain)
print(Xencoded.shape)
Xencoded.head()

(1048, 336)
[6]:
sex__male sex__female home_dest____null__ home_dest__New York, NY home_dest__London home_dest__Cornwall / Akron, OH home_dest__Winnipeg, MB home_dest__Montreal, PQ home_dest__Philadelphia, PA home_dest__Paris, France ... home_dest__Deer Lodge, MT home_dest__Bristol, England / New Britain, CT home_dest__Holley, NY home_dest__Bryn Mawr, PA, USA home_dest__Tokyo, Japan home_dest__Oslo, Norway Cameron, WI home_dest__Cambridge, MA home_dest__Ireland Brooklyn, NY home_dest__England home_dest__Aughnacliff, Co Longford, Ireland New York, NY
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 336 columns

If I specify ‘min_modalities_number’: 400, all the modalities are kept.

I’ll start filtering the modalities only if I have more than 400)

[7]:
encoder = NumericalEncoder(columns_to_use=["sex","home_dest"], drop_used_columns=False)
Xencoded = encoder.fit_transform(Xtrain)
print(Xencoded.shape)
Xencoded.head()

(1048, 19)
[7]:
pclass name sex age sibsp parch ticket fare cabin embarked boat body home_dest sex__male sex__female home_dest____null__ home_dest__New York, NY home_dest__London home_dest____default__
0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S NaN 175.0 Dorchester, MA 1 0 0 0 0 1
1 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S NaN NaN Winnipeg, MB 1 0 0 0 0 1
2 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 69.3000 B35 C 9 NaN NaN 0 1 1 0 0 0
3 3 Panula, Master. Urho Abraham male 2.0 4 1 3101295 39.6875 NaN S NaN NaN NaN 1 0 1 0 0 0
4 1 Maioni, Miss. Roberta female 16.0 0 0 110152 86.5000 B79 S 8 NaN NaN 0 1 1 0 0 0

Called like this the transformer encodes “sex” and “home_dest” but also keep them in the final result

[8]:
from aikit.transformers import TruncatedSVDWrapper

X = pd.DataFrame(np.random.randn(100,20), columns=[f"COL_{j}" for j in range(20)])

svd = TruncatedSVDWrapper(n_components=2, drop_used_columns=True)
Xencoded = svd.fit_transform(X)
Xencoded.head()
[8]:
SVD__0 SVD__1
0 2.075079 -0.858158
1 0.332307 -0.970121
2 2.279417 1.340435
3 -0.563442 0.551599
4 -1.640313 -1.569441
[9]:
svd = TruncatedSVDWrapper(n_components=2, drop_used_columns=False)
Xencoded = svd.fit_transform(X)
Xencoded.head()
[9]:
COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 COL_8 COL_9 ... COL_12 COL_13 COL_14 COL_15 COL_16 COL_17 COL_18 COL_19 SVD__0 SVD__1
0 0.858982 -0.655989 -0.028417 -0.357398 0.569531 0.145816 0.552368 1.983438 1.092890 -0.453562 ... 0.285189 -0.604234 -1.053623 -0.291745 -1.646335 -0.215531 0.008500 1.100297 2.076530 -0.854836
1 -0.187936 0.041684 0.941944 1.898925 0.179125 0.636418 2.050173 0.229349 -1.910368 0.702720 ... -0.533445 -0.371779 -0.401205 0.231492 -1.043176 1.842388 0.329271 0.882017 0.346758 -0.951460
2 1.097298 -0.136058 -0.323606 -1.096158 -0.009371 -0.945267 1.455854 -0.108160 1.141867 -1.407562 ... 2.310153 2.414735 -0.184708 -1.486121 -0.676003 -0.686621 -0.836830 0.972978 2.330389 1.407472
3 0.928934 0.269935 -1.274605 -0.287077 0.279328 -0.320871 0.802277 -0.713909 -1.039250 1.227245 ... 0.020298 0.259960 -0.885320 0.014820 0.268819 -0.432435 1.254164 0.031453 -0.572056 0.539412
4 -0.714467 1.637883 -0.451313 0.409956 0.565926 0.448906 -0.128214 -0.845320 0.433473 -0.416148 ... 0.758863 -1.702709 -0.000005 -0.293631 -0.859405 -0.167067 0.400996 -1.095900 -1.603850 -1.532443

5 rows × 22 columns

Another example of the usage of ‘drop_used_columns’ and ‘drop_unused_columns’: * in the first case (drop_used_columns = True) : only the SVD columns are retrieved * in the second case (drop_used_columns = False) : I retrive the original columns AND the svd columns

[ ]: