Choice of columns¶

[1]:

from aikit.datasets.datasets import load_dataset, DatasetEnum
Xtrain, y_train, _ ,_ , _ = load_dataset(DatasetEnum.titanic)

from aikit.transformers import NumericalEncoder

C:\HOMEWARE\Anaconda3-Windows-x86_64\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

[2]:

Xtrain

[2]:

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home_dest
0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S	NaN	175.0	Dorchester, MA
1	1	Fortune, Mr. Mark	male	64.0	1	4	19950	263.0000	C23 C25 C27	S	NaN	NaN	Winnipeg, MB
2	1	Sagesser, Mlle. Emma	female	24.0	0	0	PC 17477	69.3000	B35	C	9	NaN	NaN
3	3	Panula, Master. Urho Abraham	male	2.0	4	1	3101295	39.6875	NaN	S	NaN	NaN	NaN
4	1	Maioni, Miss. Roberta	female	16.0	0	0	110152	86.5000	B79	S	8	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1043	2	Sobey, Mr. Samuel James Hayden	male	25.0	0	0	C.A. 29178	13.0000	NaN	S	NaN	NaN	Cornwall / Houghton, MI
1044	1	Ryerson, Master. John Borie	male	13.0	2	2	PC 17608	262.3750	B57 B59 B63 B66	C	4	NaN	Haverford, PA / Cooperstown, NY
1045	2	Lahtinen, Rev. William	male	30.0	1	1	250651	26.0000	NaN	S	NaN	NaN	Minneapolis, MN
1046	3	Drazenoic, Mr. Jozef	male	33.0	0	0	349241	7.8958	NaN	C	NaN	51.0	Austria Niagara Falls, NY
1047	2	Hosono, Mr. Masabumi	male	42.0	0	0	237798	13.0000	NaN	S	10	NaN	Tokyo, Japan

1048 rows × 13 columns

[3]:

encoder = NumericalEncoder(columns_to_use=["sex","home_dest"])
Xencoded = encoder.fit_transform(Xtrain)
Xencoded.head()

[3]:

	pclass	name	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	sex__male	sex__female	home_dest____null__	home_dest____default__
0	1	McCarthy, Mr. Timothy J	54.0	0	0	17463	51.8625	E46	S	NaN	175.0	1	0	0	1
1	1	Fortune, Mr. Mark	64.0	1	4	19950	263.0000	C23 C25 C27	S	NaN	NaN	1	0	0	1
2	1	Sagesser, Mlle. Emma	24.0	0	0	PC 17477	69.3000	B35	C	9	NaN	0	1	1	0
3	3	Panula, Master. Urho Abraham	2.0	4	1	3101295	39.6875	NaN	S	NaN	NaN	1	0	1	0
4	1	Maioni, Miss. Roberta	16.0	0	0	110152	86.5000	B79	S	8	NaN	0	1	1	0

Called like this the transformer encodes “sex” and “home_dest” and keeps th other columns untouched¶

[4]:

encoder = NumericalEncoder(columns_to_use=["sex","home_dest"], drop_unused_columns=True)
Xencoded = encoder.fit_transform(Xtrain)
print(Xencoded.shape)
Xencoded.head()

(1048, 6)

[4]:

	sex__male	sex__female	home_dest____null__	home_dest____default__
0	1	0	0	1
1	1	0	0	1
2	0	1	1	0
3	1	0	1	0
4	0	1	1	0

Called like this the transformer encodes “sex” and “home_dest” and drop the other columns¶

[5]:

Xtrain["home_dest"].value_counts()

[5]:

New York, NY                                47
London                                      11
Cornwall / Akron, OH                         9
Winnipeg, MB                                 7
Montreal, PQ                                 7
                                            ..
London / Birmingham                          1
Folkstone, Kent / New York, NY               1
Treherbert, Cardiff, Wales                   1
Devonport, England                           1
Buenos Aires, Argentina / New Jersey, NJ     1
Name: home_dest, Length: 333, dtype: int64

Only the most frequent modalities are kept (this can be changed)¶

[6]:

encoder = NumericalEncoder(columns_to_use=["sex","home_dest"],
                           drop_unused_columns=True,
                           min_modalities_number=400)
Xencoded = encoder.fit_transform(Xtrain)
print(Xencoded.shape)
Xencoded.head()

(1048, 336)

[6]:

	sex__male	sex__female	home_dest____null__	home_dest__Winnipeg, MB	...
0	1	0	0	0	...
1	1	0	0	1	...
2	0	1	1	0	...
3	1	0	1	0	...
4	0	1	1	0	...

5 rows × 336 columns

If I specify ‘min_modalities_number’: 400, all the modalities are kept.¶

I’ll start filtering the modalities only if I have more than 400)

[7]:

encoder = NumericalEncoder(columns_to_use=["sex","home_dest"], drop_used_columns=False)
Xencoded = encoder.fit_transform(Xtrain)
print(Xencoded.shape)
Xencoded.head()

(1048, 19)

[7]:

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home_dest	sex__male	sex__female	home_dest____null__	home_dest____default__
0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S	NaN	175.0	Dorchester, MA	1	0	0	1
1	1	Fortune, Mr. Mark	male	64.0	1	4	19950	263.0000	C23 C25 C27	S	NaN	NaN	Winnipeg, MB	1	0	0	1
2	1	Sagesser, Mlle. Emma	female	24.0	0	0	PC 17477	69.3000	B35	C	9	NaN	NaN	0	1	1	0
3	3	Panula, Master. Urho Abraham	male	2.0	4	1	3101295	39.6875	NaN	S	NaN	NaN	NaN	1	0	1	0
4	1	Maioni, Miss. Roberta	female	16.0	0	0	110152	86.5000	B79	S	8	NaN	NaN	0	1	1	0

Called like this the transformer encodes “sex” and “home_dest” but also keep them in the final result¶

[8]:

from aikit.transformers import TruncatedSVDWrapper

X = pd.DataFrame(np.random.randn(100,20), columns=[f"COL_{j}" for j in range(20)])

svd = TruncatedSVDWrapper(n_components=2, drop_used_columns=True)
Xencoded = svd.fit_transform(X)
Xencoded.head()

[8]:

	SVD__0	SVD__1
0	2.075079	-0.858158
1	0.332307	-0.970121
2	2.279417	1.340435
3	-0.563442	0.551599
4	-1.640313	-1.569441

[9]:

svd = TruncatedSVDWrapper(n_components=2, drop_used_columns=False)
Xencoded = svd.fit_transform(X)
Xencoded.head()

[9]:

	COL_0	COL_1	COL_2	COL_3	COL_4	COL_5	COL_6	COL_7	COL_8	COL_9	...	COL_12	COL_13	COL_14	COL_15	COL_16	COL_17	COL_18	COL_19	SVD__0	SVD__1
0	0.858982	-0.655989	-0.028417	-0.357398	0.569531	0.145816	0.552368	1.983438	1.092890	-0.453562	...	0.285189	-0.604234	-1.053623	-0.291745	-1.646335	-0.215531	0.008500	1.100297	2.076530	-0.854836
1	-0.187936	0.041684	0.941944	1.898925	0.179125	0.636418	2.050173	0.229349	-1.910368	0.702720	...	-0.533445	-0.371779	-0.401205	0.231492	-1.043176	1.842388	0.329271	0.882017	0.346758	-0.951460
2	1.097298	-0.136058	-0.323606	-1.096158	-0.009371	-0.945267	1.455854	-0.108160	1.141867	-1.407562	...	2.310153	2.414735	-0.184708	-1.486121	-0.676003	-0.686621	-0.836830	0.972978	2.330389	1.407472
3	0.928934	0.269935	-1.274605	-0.287077	0.279328	-0.320871	0.802277	-0.713909	-1.039250	1.227245	...	0.020298	0.259960	-0.885320	0.014820	0.268819	-0.432435	1.254164	0.031453	-0.572056	0.539412
4	-0.714467	1.637883	-0.451313	0.409956	0.565926	0.448906	-0.128214	-0.845320	0.433473	-0.416148	...	0.758863	-1.702709	-0.000005	-0.293631	-0.859405	-0.167067	0.400996	-1.095900	-1.603850	-1.532443

5 rows × 22 columns

Another example of the usage of ‘drop_used_columns’ and ‘drop_unused_columns’: * in the first case (drop_used_columns = True) : only the SVD columns are retrieved * in the second case (drop_used_columns = False) : I retrive the original columns AND the svd columns

[ ]: