Understanding the difference between Label Encoding and One Hot Encoding
If you are interested in machine learning and while you watching a video, visiting a repository or reading a machine learning article there is some terms that you definitely encounter with.It is ENCODİNG!
Encoding is the process of converting data from one form to another.You don’t change any information about the data, only representation is changing.In machine learning encoding is the task of converting categorical features into numerical.Let’s get started to understand Label Encoding.
1.Label Encoding
Before we dive into Label Encoding there is one concepts we should cover.It is Ordinal Scale.
An Ordinal scale is a variable in which the value of the data is captured from an ordered set.Let’s say the Feedback column is collected using a five-point scale. The numerical code 1, is assigned to Poor, 2 for Fair, 3 for Good, 4 for Very Good, and 5 for Excellent. We can observe that 5 is better than 4, and 5 is much better than 3.
We can understand excellent is the best but for your model you have to tell to your algorithm that excellent is more better than poor.So you have to give a big number to imply that the excellent is the best.İn a nutshell if you have an ordinal column you should do this conversion.But don’t forget to scale the column after the preprocessing.
This is an example how to do it manually.You can also use scikit-learn Labelencoder pre-defined library.
2.One Hot Encoding
As you see in the first example above some categorical variables are important or better in some manner.But some categorical variables like nationality(british,canadian,indian) are not important or better than one another.So how should we process this types of columns ?
In order to process this types of columns we have to do one hot encoding.As you can see above,we have created 3 column from nationality column.So this way model is going to treat equally.