The term ‘big data’ is thrown around so often that it is rightfully becoming a cliché. However, big data – the quantification, recording, and analysis of human behavior and interaction – is undeniably here to stay.
Any action taking place in the virtual realm leaves evidence in the form of click-through rates, registration forms, transaction records, and on and on and on. For many businesses, big data represents a revolutionary opportunity to improve their market penetration and customer engagement: ideally, the more we know about consumers and clients, the better we can meet their needs.
However, the blessing of big data is also a curse: this ‘fire hose’ of incoming data is unparalleled in history – not just the amount of raw data, but the speed at which new information is constantly being generated.
Companies and groups seeking to capitalize on this torrent of data can find themselves drowning in opportunities, as they are presented with dozens, hundreds, or thousands of potentially important sources of information on consumer behavior that might be important in predicting outcomes of interest.
Big Data Analysts are Drowning in a Sea of Data
While computing power has grown incredibly in the past decade (and shows no sign of slowing down any time soon) the pace of big data generation has exploded as well. This means that even with incredibly powerful computing engines and sophisticated estimation algorithms, dealing with huge amounts of data can be unfeasible for two reasons.
First, simply throwing everything into a machine-learning algorithm is not a viable option when you are working with hundreds, thousands, or even millions of features. This is particularly true for cases where prediction methods have to be agile, able to adjust themselves to changes in incoming data and underlying relationships in near-real-time.
Second, the logistics involved in gathering and curating massive data sets are substantial. Disparate sources of information have to be cultivated and maintained, and this can require significant and ongoing investment in data gathering and storage systems.
The Importance of Feature Selection is Becoming Increasingly Important as The Number of Data Points Grow
Big data analysis, more than ever, requires the ability to sift through enormous reams of non-useful information to find key features that have predictive value in solving the business problem at hand, and focusing on these features to create predictive engines that are not only accurate, but also parsimonious.
A key technique in building parsimonious models is feature selection: picking the set of ‘best’ features that provide the most bang for your computational buck. Winnowing a data set from thousands to tens of potential predictive features saves computing time, storage capability, and – most importantly – the time and effort of your data science team.
This allows your team to focus on gathering and curating only the most useful data sources; engineering and identifying new latent information from a smaller, more manageable subset of raw features; and create highly agile models that can be retrofitted in minutes rather than hours or days.
What Does the Future Hold for Feature Selection?
Feature selection relies on statistical estimation to identify how important a given input feature is to a predictive engine’s accuracy. Because of this, feature selection techniques generally involve a high up-front cost: repeatedly running predictive models on different subsets of features and comparing outputs can be expensive in terms of time and human effort.
The ability to automate this process is hugely important: while still computationally expensive, a ‘set it and forget it’ system to automatically engineer and select the most useful features is crucial to fast, accurate, and parsimonious data analysis. With this capability, a small data science team can gain enormous leverage on big data that involves many features as well as many observations.