Dealing with Imbalanced Dataset

In this tutorial,we are tackle the problem of imbalanced dataset and get the knowledge of how to spot them easily and SMOTE technique

Let’s say you just graduated from college and you looking for a job.One of the company that you applied before, sending you a case study including classification task.You done preprocessing,implementing various techniques,finding the right model etc. and then fit the model for test set see how it can manage the unseen data and you got the result of %93.52 accuraccy.You suddenly become so happy and you thought your model was done a great job.Then you go ahead and print the confusion matrix, output was something similar to this;

Model really done a tremendous job by finding the True Positives(upper left) but only able to catch 1 True Negative(rest of them is classified wrongly).But how this is happened?(yep, your inner voice is right.it’s because of you have imbalaced dataset.)

1.Metrics

So first mistake you done was only check for the accuracy metric.Accuracy metric,is a metric that summarizes the performance of a classification model as the number of correct predictions divided by the total number of predictions.

Most common mistake made by beginners to imbalanced classification is using only accuracy metric.Achieving 90 percent classification accuracy, or even 99 percent classification accuracy might not be a valid score.

So in this case what you should do is check for other classification metrics.Such as precision,recall and F1 score.

Once you print this score you most likely to see recall and F1 score is very small.

2.SMOTE(Synthetic Minority Oversampling Technique)

The reason why i wrote this blog is show you to SMOTE technique.As you can understand from it’s name it is an oversampling technique using minorty class.

Oversampling is,used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data but it does not cause any increase in the variety of training examples. İt’s look like this,

Smote on the other hand,not only increases the size of the training data set, it also increases the variety.

SMOTE creates new (artificial) training examples based on the original training examples. For instance, if it sees two examples (of the same class) near each other, it creates a third artificial one, bang in the middle of the original two.

SMOTE, when done right, is preferable over old random oversampling. One, however, has to be careful about the newly created examples and must make sure that they are ‘legal’. For example, for an input for which legal values are a table and chair, there is no legal in-between value.

That’s how easy implement and also how important it is.

İf you have further question do not hesitate to ask in the comments.

Also who might want to see my imbalanced dataset repo link is;