Useful Notes and Links

Reynier Cruz-Torres, PhD

Dealing with mbalanced datasets

Stratifying with test_train_split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, 
                                                    stratify=y)

Undersampling majority class

!pip install imblearn
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=0)
x_resampled,y_resampled = rus.fit_resample(x,y)

Oversampling minority class

Can use a package like SMOTE, which involves the following steps:

Combination of oversampling and undersampling

With SMOTETomek we can first oversample minority class and then undersample to remove samples close to the class boundary.

from imblearn.combine import SMOTETomek
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks

## Oversampling
s= SMOTE()
X_res, y_res= s.fit_resample(X_train, y_train)


## Undersampling
tk= TomekLinks()
X_res2, y_res2= tk.fit_resample(X_train, y_train)

## Combination
smt= SMOTETomek()
X_res1, y_res1= smt.fit_resample(X_train, y_train)

Weighted models

# define class weights
w = {0:1, 1:99}

# define model
lg2 = LogisticRegression(random_state=13, class_weight=w)