Data preprocessing is an important step before fitting any model. The following steps are performed under data preprocessing:
- Handling missing values
- Handling outliers
- Transforming nominal variables to dummy variables
- Converting ordinal data to numbers
- Transformation
In this post, with the help of an example, let us explore transformation:
- Standardization
- Normalization
- Log transformation
- How to transform data in Python
The example data contains four columns. In fact, the last two columns are derived from first two columns. Height (cm) and height (m) measure the same thing only thing that is different is the unit. The is the case with Weight (g) and Weight (kg).
1. Standardization
This is the most common transformation used. All the observations are subtracted by the mean of that column and then divided by the standard deviation of that column.
Using the sklearn StandardScaler option, let us standardize the four columns of our example data set.
If we want to scale only using mean not standard deviation, or if we want to scale only using standard deviation but not using mean, we can use the relevant option (as shown Out [5]). By default, both with_mean and with_std are set to True.
If we check the mean and standard deviation, these are 0 and 1 respectively.
If you estimate regression coefficients using standardized features, you can directly compare regression coefficients. Higher is the value of the coefficient higher is its predictive power or the influence on the dependent variables.
Standardization is necessary in case of:
- RBF kernel of Support Vector Machines
- L1 and L2 regularizers of linear models
If there are outliers, better to use RobustScaler or QuantileTransformer.
2. Normalization
Unlike standardization, normalization is per sample transformation not per feature transformation.
This transforms the data to unit norms using the l1’, ‘l2’, or ‘max’ norms.
In case of l1 norm, the sum of observations in each rows will be one (as shown in the pic below). In case of l2 norm, the square root of the sum of the squares of in each row will be one.
3. Log transformation
Log transformation is more common in time series data. Log transformation also helps to handle outliers when data is skewed to the right. For applying log transformation, data need to be positive and non-zero.
This is how we can log transform the data (natural log).
Another transformations is Box-Cox transformation. In this case also, input data should be be positive.
Summary
In this post, we have explored:
- Standardization
- Normalization
- Log transformation
- And how to perform these transformations in Python