Useful Notes and Links

Reynier Cruz-Torres, PhD

Dealing with mbalanced datasets

Loading more data
Changing perfomance metrics
Resampling (undersampling or oversampling)
Changing the algorithm (e.g. using penalized models)

Stratifying with test_train_split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, 
                                                    stratify=y)

Undersampling majority class

!pip install imblearn
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=0)
x_resampled,y_resampled = rus.fit_resample(x,y)

Oversampling minority class

Can use a package like SMOTE, which involves the following steps:

Find difference between a point and its nearest neighbor
Multiply difference by random number between 0 and 1
Add this difference to original point to create a new synthetic example
Continue with your next-nearest neighbor until you reach the desired number of iterations

Combination of oversampling and undersampling

With SMOTETomek we can first oversample minority class and then undersample to remove samples close to the class boundary.

from imblearn.combine import SMOTETomek
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks

## Oversampling
s= SMOTE()
X_res, y_res= s.fit_resample(X_train, y_train)


## Undersampling
tk= TomekLinks()
X_res2, y_res2= tk.fit_resample(X_train, y_train)

## Combination
smt= SMOTETomek()
X_res1, y_res1= smt.fit_resample(X_train, y_train)

Weighted models

# define class weights
w = {0:1, 1:99}

# define model
lg2 = LogisticRegression(random_state=13, class_weight=w)