Data Science Simplified: Data Preprocessing: Transformation

Data preprocessing is an important step before fitting any model. The following steps are performed under data preprocessing:

Handling missing values
Handling outliers
Transforming nominal variables to dummy variables
Converting ordinal data to numbers
Transformation

In this post, with the help of an example, let us explore transformation:

Standardization
Normalization
Log transformation
How to transform data in Python

Example data

The example data contains four columns. In fact, the last two columns are derived from first two columns. Height (cm) and height (m) measure the same thing only thing that is different is the unit. The is the case with Weight (g) and Weight (kg).

1. Standardization

This is the most common transformation used. All the observations are subtracted by the mean of that column and then divided by the standard deviation of that column.

Using the sklearn StandardScaler option, let us standardize the four columns of our example data set.

If we want to scale only using mean not standard deviation, or if we want to scale only using standard deviation but not using mean, we can use the relevant option (as shown Out [5]). By default, both with_mean and with_std are set to True.

If we check the mean and standard deviation, these are 0 and 1 respectively.

If you estimate regression coefficients using standardized features, you can directly compare regression coefficients. Higher is the value of the coefficient higher is its predictive power or the influence on the dependent variables.

Standardization is necessary in case of:

RBF kernel of Support Vector Machines
L1 and L2 regularizers of linear models

If there are outliers, better to use RobustScaler or QuantileTransformer.

2. Normalization

Unlike standardization, normalization is per sample transformation not per feature transformation.

This transforms the data to unit norms using the l1’, ‘l2’, or ‘max’ norms.

In case of l1 norm, the sum of observations in each rows will be one (as shown in the pic below). In case of l2 norm, the square root of the sum of the squares of in each row will be one.

3. Log transformation

Log transformation is more common in time series data. Log transformation also helps to handle outliers when data is skewed to the right. For applying log transformation, data need to be positive and non-zero.

Log transforming the right skewed data

This is how we can log transform the data (natural log).

Another transformations is Box-Cox transformation. In this case also, input data should be be positive.

Summary

In this post, we have explored:

Standardization
Normalization
Log transformation
And how to perform these transformations in Python

If you have any questions or suggestions, feel free to share. I will be very happy to interact with you.

Data Science Simplified

Data Preprocessing: Transformation - Explained with Visual Examples

In this post, with the help of an example, let us explore transformation:

1. Standardization

2. Normalization

3. Log transformation

Summary

Popular Posts