Data Science Simplified: Feature Selection using sklearn

In this post, we will understand how to perform Feature Selection using sklearn.

Dropping features which have low variance

Dropping features with zero variance
Dropping features with variance below the threshold variance

Univariate feature selection
Model based feature selection
Feature Selection using pipeline

Let us start to explore these concepts one by one.

1) Dropping features which have low variance

If any features have low variance, they may not contribute in the model. For example, in the following dataset, features "Offer" and "Online payment" have zero variance, that means all the values are same. These two features can be dropped without any negative impact on the model to be built.

A) Dropping features with zero variance

If a feature has same values across all observations, then we can remove that variable. In the following example, two features can be removed.

Dataset with two features having zero variance

By default, variance threshold is zero in VarianceThreshold option in sklearn.feature_selection.

Default variance threshold is zero

Using the following code, we can retain only the variables with non-zero variance.

VarianceThreshold option drops two features with zero variance

B) Dropping features with variance below the threshold variance

In the following example, dataset contains five features, out of which two features, "Referred" and "Repeat" do not vary much. Since data which contains values 0 and 1 are Bernoulli random variables, variance is given by the formula: p(1-p).

Dataset with two features (Referred and Repeat) having low variance

If we want to retain a feature which contains only 0s 80% of the time or only 1s 80% of the time, then the variance of that feature would be: 0.8*(1-0.8)= 0.16.

We can mention VarianceThreshold(threshold=(.8 * (1 - .8))) or VarianceThreshold(threshold=0.16).

Feature with either only 1s or 0s in 80% of the time is dropped

2) Univariate feature selection

In this type of selection method, a score is computed to capture the importance of feature. Score can be calculated using different measures such as Chi-square, F value, mutual information etc.

The following are the some of the options available in univariate feature selection.

Let us use the example provided by sklearn to understand how univariate feature selection works.

In the following example, original iris dataset contains four predictors.