Updated: Jun 28
In this blog, we will learn when we should scale the features. I usually divide Data Science projects into different pipelines: Exploratory Data Analysis, Feature Engineering, Feature Selection, and Model Evaluation. Feature scaling comes under the feature engineering pipeline, and it is a monotonic transformation ( means transformation will not change trends in feature values ).
To answer our question 'Do I need to scale me features?', we first need to understand 'Importance of feature scaling help in Data Sceince Pipeline.'
As a rule of thumb, we should always use feature scaling with algorithms, that consider distance between points.
The dataset contains different columns, and columns can have different units like weight column can Kg as a unit, height column can have cm as a unit. Furthermore, some columns can have values ranging between 10000 to 100000 and other columns can have values ranging between 0 to 10. During the model evaluation phase, some features can take precedence over others bases on their values.
Importance of Feature Scaling
It brings down all features to the same level, i.e., all features value varies between the same range.
It helps to convert Gaussian Normal Distribution to Standard Normal Distribution, which will bring down the mean to 0 and Standard Deviation to 1 of the feature. This transformation will help calculate the probability easily using the Z Score Table.
It improves the performance of the Algorithms using the concept of Gradient Descent by quickly converging to an optimal value. A smaller range in feature values allows us to use a bigger learning rate. A bigger learning rate will help in quick convergence.
It helps to understand the significance of intercept in regression better. In Y = m x + c, c is an intercept, which can be defined as a value of Y (Dependent Variable), when X (Independent Variable/ Predictor ) is 0. In a scenario where X is the weight ( or some value that cannot be Zero ), the Zero value of weight cannot explain the intercept. Therefore by centering or scaling the predictor variable will remove the intercept from the equation.
It prevents high-range features from taking precedence over low-range features.
When do we perform scaling?
During Principle Component Analysis(PCA), PCA tries to get features with maximum variance.
During Gradient Descent to increase the convergence rate.
When ML Algorithm takes Distance in Consideration .e.g. Logistics Regression, KNN, KMeans.
When do we not perform scaling?
In Linear Discriminant Analysis(LDA), Naive Bayes both are already equipped to handle this and give weights to different features accordingly.
In Tree-Based Algorithms, they don't consider the distance and are not sensitive to variance in features, and outliers don't impact the result and performance.
I hope this blog helps you understand the importance of feature scaling and when to use it, and when not to use it. In my next blog, I will discuss different methods of feature scaling. Comments and feedback are most welcomed. Please follow me on Linkedin, Github. Thanks for reading. Happy Learning 😊