Choice of columns¶
[1]:
from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)
from aikit.transformers import NumericalEncoder
C:\HOMEWARE\Anaconda3-Windows-x86_64\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
[2]:
Xtrain
[2]:
pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home_dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | NaN | 175.0 | Dorchester, MA |
1 | 1 | Fortune, Mr. Mark | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S | NaN | NaN | Winnipeg, MB |
2 | 1 | Sagesser, Mlle. Emma | female | 24.0 | 0 | 0 | PC 17477 | 69.3000 | B35 | C | 9 | NaN | NaN |
3 | 3 | Panula, Master. Urho Abraham | male | 2.0 | 4 | 1 | 3101295 | 39.6875 | NaN | S | NaN | NaN | NaN |
4 | 1 | Maioni, Miss. Roberta | female | 16.0 | 0 | 0 | 110152 | 86.5000 | B79 | S | 8 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1043 | 2 | Sobey, Mr. Samuel James Hayden | male | 25.0 | 0 | 0 | C.A. 29178 | 13.0000 | NaN | S | NaN | NaN | Cornwall / Houghton, MI |
1044 | 1 | Ryerson, Master. John Borie | male | 13.0 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C | 4 | NaN | Haverford, PA / Cooperstown, NY |
1045 | 2 | Lahtinen, Rev. William | male | 30.0 | 1 | 1 | 250651 | 26.0000 | NaN | S | NaN | NaN | Minneapolis, MN |
1046 | 3 | Drazenoic, Mr. Jozef | male | 33.0 | 0 | 0 | 349241 | 7.8958 | NaN | C | NaN | 51.0 | Austria Niagara Falls, NY |
1047 | 2 | Hosono, Mr. Masabumi | male | 42.0 | 0 | 0 | 237798 | 13.0000 | NaN | S | 10 | NaN | Tokyo, Japan |
1048 rows × 13 columns
[3]:
encoder = NumericalEncoder(columns_to_use=["sex","home_dest"])
Xencoded = encoder.fit_transform(Xtrain)
Xencoded.head()
[3]:
pclass | name | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | sex__male | sex__female | home_dest____null__ | home_dest__New York, NY | home_dest__London | home_dest____default__ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | McCarthy, Mr. Timothy J | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | NaN | 175.0 | 1 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | Fortune, Mr. Mark | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S | NaN | NaN | 1 | 0 | 0 | 0 | 0 | 1 |
2 | 1 | Sagesser, Mlle. Emma | 24.0 | 0 | 0 | PC 17477 | 69.3000 | B35 | C | 9 | NaN | 0 | 1 | 1 | 0 | 0 | 0 |
3 | 3 | Panula, Master. Urho Abraham | 2.0 | 4 | 1 | 3101295 | 39.6875 | NaN | S | NaN | NaN | 1 | 0 | 1 | 0 | 0 | 0 |
4 | 1 | Maioni, Miss. Roberta | 16.0 | 0 | 0 | 110152 | 86.5000 | B79 | S | 8 | NaN | 0 | 1 | 1 | 0 | 0 | 0 |
Called like this the transformer encodes “sex” and “home_dest” and keeps th other columns untouched¶
[4]:
encoder = NumericalEncoder(columns_to_use=["sex","home_dest"], drop_unused_columns=True)
Xencoded = encoder.fit_transform(Xtrain)
print(Xencoded.shape)
Xencoded.head()
(1048, 6)
[4]:
sex__male | sex__female | home_dest____null__ | home_dest__New York, NY | home_dest__London | home_dest____default__ | |
---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 1 | 1 | 0 | 0 | 0 |
3 | 1 | 0 | 1 | 0 | 0 | 0 |
4 | 0 | 1 | 1 | 0 | 0 | 0 |
Called like this the transformer encodes “sex” and “home_dest” and drop the other columns¶
[5]:
Xtrain["home_dest"].value_counts()
[5]:
New York, NY 47
London 11
Cornwall / Akron, OH 9
Winnipeg, MB 7
Montreal, PQ 7
..
London / Birmingham 1
Folkstone, Kent / New York, NY 1
Treherbert, Cardiff, Wales 1
Devonport, England 1
Buenos Aires, Argentina / New Jersey, NJ 1
Name: home_dest, Length: 333, dtype: int64
Only the most frequent modalities are kept (this can be changed)¶
[6]:
encoder = NumericalEncoder(columns_to_use=["sex","home_dest"],
drop_unused_columns=True,
min_modalities_number=400)
Xencoded = encoder.fit_transform(Xtrain)
print(Xencoded.shape)
Xencoded.head()
(1048, 336)
[6]:
sex__male | sex__female | home_dest____null__ | home_dest__New York, NY | home_dest__London | home_dest__Cornwall / Akron, OH | home_dest__Winnipeg, MB | home_dest__Montreal, PQ | home_dest__Philadelphia, PA | home_dest__Paris, France | ... | home_dest__Deer Lodge, MT | home_dest__Bristol, England / New Britain, CT | home_dest__Holley, NY | home_dest__Bryn Mawr, PA, USA | home_dest__Tokyo, Japan | home_dest__Oslo, Norway Cameron, WI | home_dest__Cambridge, MA | home_dest__Ireland Brooklyn, NY | home_dest__England | home_dest__Aughnacliff, Co Longford, Ireland New York, NY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 336 columns
If I specify ‘min_modalities_number’: 400, all the modalities are kept.¶
I’ll start filtering the modalities only if I have more than 400)
[7]:
encoder = NumericalEncoder(columns_to_use=["sex","home_dest"], drop_used_columns=False)
Xencoded = encoder.fit_transform(Xtrain)
print(Xencoded.shape)
Xencoded.head()
(1048, 19)
[7]:
pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home_dest | sex__male | sex__female | home_dest____null__ | home_dest__New York, NY | home_dest__London | home_dest____default__ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | NaN | 175.0 | Dorchester, MA | 1 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | Fortune, Mr. Mark | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S | NaN | NaN | Winnipeg, MB | 1 | 0 | 0 | 0 | 0 | 1 |
2 | 1 | Sagesser, Mlle. Emma | female | 24.0 | 0 | 0 | PC 17477 | 69.3000 | B35 | C | 9 | NaN | NaN | 0 | 1 | 1 | 0 | 0 | 0 |
3 | 3 | Panula, Master. Urho Abraham | male | 2.0 | 4 | 1 | 3101295 | 39.6875 | NaN | S | NaN | NaN | NaN | 1 | 0 | 1 | 0 | 0 | 0 |
4 | 1 | Maioni, Miss. Roberta | female | 16.0 | 0 | 0 | 110152 | 86.5000 | B79 | S | 8 | NaN | NaN | 0 | 1 | 1 | 0 | 0 | 0 |
Called like this the transformer encodes “sex” and “home_dest” but also keep them in the final result¶
[8]:
from aikit.transformers import TruncatedSVDWrapper
X = pd.DataFrame(np.random.randn(100,20), columns=[f"COL_{j}" for j in range(20)])
svd = TruncatedSVDWrapper(n_components=2, drop_used_columns=True)
Xencoded = svd.fit_transform(X)
Xencoded.head()
[8]:
SVD__0 | SVD__1 | |
---|---|---|
0 | 2.075079 | -0.858158 |
1 | 0.332307 | -0.970121 |
2 | 2.279417 | 1.340435 |
3 | -0.563442 | 0.551599 |
4 | -1.640313 | -1.569441 |
[9]:
svd = TruncatedSVDWrapper(n_components=2, drop_used_columns=False)
Xencoded = svd.fit_transform(X)
Xencoded.head()
[9]:
COL_0 | COL_1 | COL_2 | COL_3 | COL_4 | COL_5 | COL_6 | COL_7 | COL_8 | COL_9 | ... | COL_12 | COL_13 | COL_14 | COL_15 | COL_16 | COL_17 | COL_18 | COL_19 | SVD__0 | SVD__1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.858982 | -0.655989 | -0.028417 | -0.357398 | 0.569531 | 0.145816 | 0.552368 | 1.983438 | 1.092890 | -0.453562 | ... | 0.285189 | -0.604234 | -1.053623 | -0.291745 | -1.646335 | -0.215531 | 0.008500 | 1.100297 | 2.076530 | -0.854836 |
1 | -0.187936 | 0.041684 | 0.941944 | 1.898925 | 0.179125 | 0.636418 | 2.050173 | 0.229349 | -1.910368 | 0.702720 | ... | -0.533445 | -0.371779 | -0.401205 | 0.231492 | -1.043176 | 1.842388 | 0.329271 | 0.882017 | 0.346758 | -0.951460 |
2 | 1.097298 | -0.136058 | -0.323606 | -1.096158 | -0.009371 | -0.945267 | 1.455854 | -0.108160 | 1.141867 | -1.407562 | ... | 2.310153 | 2.414735 | -0.184708 | -1.486121 | -0.676003 | -0.686621 | -0.836830 | 0.972978 | 2.330389 | 1.407472 |
3 | 0.928934 | 0.269935 | -1.274605 | -0.287077 | 0.279328 | -0.320871 | 0.802277 | -0.713909 | -1.039250 | 1.227245 | ... | 0.020298 | 0.259960 | -0.885320 | 0.014820 | 0.268819 | -0.432435 | 1.254164 | 0.031453 | -0.572056 | 0.539412 |
4 | -0.714467 | 1.637883 | -0.451313 | 0.409956 | 0.565926 | 0.448906 | -0.128214 | -0.845320 | 0.433473 | -0.416148 | ... | 0.758863 | -1.702709 | -0.000005 | -0.293631 | -0.859405 | -0.167067 | 0.400996 | -1.095900 | -1.603850 | -1.532443 |
5 rows × 22 columns
Another example of the usage of ‘drop_used_columns’ and ‘drop_unused_columns’: * in the first case (drop_used_columns = True) : only the SVD columns are retrieved * in the second case (drop_used_columns = False) : I retrive the original columns AND the svd columns
[ ]: