Feature Selection: the "why" , the "what" and the "how"

January 4, 2022

Data scientists often use Feature Selection techniques to reduce the number of features and keep the most relevant/useful ones before training a ML model on data. It can improve data quality, and help the ML model to focus on the most relevant information in the data, thus improving the efficiency and effectiveness of the training.

Why Feature Selection is Important?

Data is often collected before knowing what Machine Learning tasks will apply to it. It is difficult to know which data may be valuable or even relevant for the Machine Learning tasks. Therefore, data is often collected excessively, as a common practice, just in case the data scientists may need it in the future.Though all sounds reasonable, when it comes to training a Machine Learning model, data scientists always prefer high-quality data, with all the features in the data relevant to the ML task and which can contribute to accurate predictions. If there are irrelevant features or even random noise in the data, some ML models may try to make sense of the noise and overfit the training data, which, in turn, will most certainly underperform on new data. The issue is aggravated as the noise becomes more prevalent.

Figure 1. Overfitting Image by geeksforgeeks.org

What is Feature Selection?

In Machine Learning, Feature Selection is the process of selecting a subset of the most relevant/useful features for use in model construction. Apart from removing irrelevant features, feature selection could be beneficial in different ways:

Shorten training times.The training time of many models depends on the number of data samples and the number of features. For cases where training time is critical, feature selection can be used to reduce the number of features and shorten the training time.
Remove redundancy. Some features in the data can bear duplicate information, such as duplicated features or highly correlated features. Such redundancy in the data could wrongfully emphasise the duplicate information, which in turn, may cause the models to overfit. Some feature selection techniques target such situation and try to reduce redundancy by removing certain features.
Avoid the curse of dimensionality. In order to learn any meaningful patterns in the data, the data itself has to be representative enough for the feature space from which the samples are drawn. For high-dimensional space, the number of samples required for learning meaningful patterns is usually enormous. When there are not enough samples, though all the features could be relevant, one should still use feature selection to reduce the number of features such that the new feature space is better represented by the data samples.
Limit the number of generated features. When engineering features, new features are often generated by encoding or combining features. Some feature generation methods may introduce an enormous number of new features, which may introduce the curse of dimensionality mentioned above. When used collectively with such feature generation methods, feature selection can effectively limit the number of new features to avoid such issues.

Though some people believe ML models should be capable of automatically learning which features from the dataset are more valuable, in practice, feature selection can still improve the performance of the models by a large margin. It makes it easier for models to focus on learning patterns in the dataset with fewer samples. Furthermore, it allows simpler and easier-to-explain models to perform just as well as advanced and difficult-to-explain models.

Figure 2. Feature Selection Image by Chaitanya Sagar

How to Apply Feature Selection

There are many different feature selection methods based on various purposes and theories. The techniques listed below are a non-exhaustive list of the most common ones.

1. Data quality

Intuitively, data with lower quality and/or more noise can hardly contribute to the accuracy of a model. Features of low quality are therefore often removed before training a model. There are several techniques derived for this purpose. One of them looks at the variances of features and removes those with variance lower than a certain threshold. Some techniques look at the pair-wise correlations among features and remove any highly correlated features. In addition, if a feature is not correlated to the target at all, it is unlikely that a model can benefit from it. Therefore, some feature selection techniques use different correlation metrics to evaluate the potential relationship between a feature and the target and keep only the most relevant ones.

2. Information theory

There are different Feature Selection mechanisms around the utilisation of mutual information for scoring different features. Instead of looking at the correlation of each feature and the target separately, they determine which feature to include by evaluating which one can provide the most information gain if added to a set of selected features. Such mechanisms generally start with an empty set of features and iteratively add those that provide the most information gain until a certain number of features is reached.

3. Feature importance

Some ML models can automatically identify which features are more relevant during training. Such information is provided as feature importance or feature coefficients by those models. Some feature selection techniques utilise this information to select the most valuable features. However, the features chosen in this way are biased by the ML model, and different models may give different importance to the same feature. One way of alleviating such bias is to use multiple models of different types and aggregate their feature importance together. If you already know which model will be used for the task, you can choose the same model to select features that will be most useful for this particular model.

4. Heuristics search

Some may consider the feature selection process as an optimisation problem. Therefore, search algorithms can be applied to find the optimal subset of features. Such feature selection techniques search systematically or heuristically in the space of the power set of all features. For each feature subset, they provide a score to it according to some evaluation metrics. It can be the performance of a model trained using only this subset of features. In this way, the feature selection process becomes a process of finding the optimal solution (a subset of features) with the highest/lowest score in the whole search space. Different search algorithms can then be applied to such problems, including brute-force, hill-climbing, and meta-heuristics algorithms.

5. Information compression

All the above methods select a subset of the features without changing them. The cost for such selection is naturally the loss of the information in the removed features. If you want to reduce the number of features without losing much information and you don’t mind the dataset being completely changed, some information compression methods can achieve that. Such techniques generally try to convert the dataset in a way so that all the valuable information condenses to a smaller number of features. The most common methods in this category include PCA , SVD , and autoencoder.

Aforementioned are just some common feature selection mechanisms to start with. There are many more feature selection mechanisms used in some niche areas. All these feature selection techniques can be combined to create new ones or can be applied sequentially to complement each other. Which ones to use should depend on your dataset and what you would like to achieve. Hopefully, this article gives enough information to help you start building your feature selection algorithms for your scenarios.

Better Feature Selection with EvoML

Our EvoML platform uses a combination of different feature selection techniques that automatically adapt to a given dataset. It firstly analyses the dataset and removes any duplicates or highly correlated features. It then further selects the best features by combining the feature importance and some of the data quality approaches together. In addition, information compression approaches may or may not be applied according to our meta-heuristics search algorithm. Furthermore, the whole feature selection processes is highly customisable, so that our users have the ultimate flexibility when using EvoML.*You may also like our article on feature generation :)

About the Author

Fan Wu | TurinTech CSOExpert in evolutionary algorithms. Reputable paper author, multi-award winner and former Assistant Professor. Previously worked with BNP Paribas and Morgan Stanley.Matvey Fedoseev | TurinTech Research TeamEnthusiastic about learning new things. Passionate about Finance, Data Science, Machine Learning and solving complex problems.