1) Communication: Unclear questions and outcome metrics
A fundamental challenge facing data scientists has nothing to do with ensemble algorithms, optimization methods, or computing power. Communication – prior to any analysis or data engineering – is crucial to solving an ML problem quickly and painlessly.
There are many, many questions ML can solve: this is an incredibly powerful tool for making sense of the world around us. However, these questions have to be specific and formulaic in a way that the people responsible for identifying the problem, such as management or marketing, might be unfamiliar with.
Questions as posed in a ‘real-world’ environment, while substantively useful for framing and approaching a business problem, are often too vague to translate directly into ML modeling. Because of this, it is crucial to communicate effectively between different branches within the organization: the ‘small’ question being solved by ML modeling has to match the ‘big’ question that constitutes the business problem itself.
2) Feature engineering: getting more information out of a data set
Feature engineering and feature selection are important parts of any ML task. Even with highly sophisticated estimation algorithms and powerful, cheap computing capabilities, the data scientist plays an important role in creating a model that is both accurate and efficient.
Significant time and energy can (and should!) be spent on looking over the data itself to try and identify additional information that may be ‘hiding’ in the features already included. It may be, for example, that the difference between two values (for example, length of time since a customer’s most recent transaction) matters more to predictive accuracy than either of the values themselves.
This means that feature engineering is a combination of subject matter expertise and general intuition: skilled feature engineers can pull the maximum amount of useful information out of a given set of input data, giving an ML model the most informative data set possible to work with.
3) Logistics: budgeting computational resources
Few things are more frustrating than putting in hours, days, or weeks of work on cleaning and preparing a data set for analysis, only to hit an ‘out of memory’ error when trying to build the finalized model. Budgeting computational resources for ML estimation can be tricky: over-budgeting on powerful computing systems can waste significant money, but under-budgeting can produce severe bottlenecks in model construction and deployment.
However, cloud computing has taken dramatic steps towards making computational pipelines more expandable. Using a system like Amazon’s AWS allows for the deployment of larger virtual machines (or greater numbers of machines, if working in parallel) with relatively low cost and high speed. This type of elastic-computing framework makes it much, much easier to budget appropriately when setting up an ML system, especially when working with very large data sets.
4) Generalizability: Conflation of training and testing data sets
a. Particularly for those who are first getting into data science, this can be an easy step to miss, but it is incredibly important. ML models are built for estimation: their purpose is to intake new data and generate values that can be used to guide future decisions. Because of this, it is absolutely crucial to separate ‘training’ data that is used to fit an original ML model from ‘testing’ data that is used to assess the model’s accuracy.
Failure to do some type of out-of-sample testing can result in a model that looks fantastic in terms of accuracy and fit statistics… and then fails miserably when faced with new, unfamiliar data. Generalizability is key to creating usable long-term ML solutions, and as such, models need to be tested on independent, out-of-sample data before being put into regular use. A solid rule of thumb is to hold back 20-25% of the original data set: this is testing data, and should be kept entirely separate from the 75-80% of data used to build the ML model itself.
5) Focusing on the little things: algorithm choice
a. The range of algorithms available for ML problem solving is astounding. Random forests, support vector machines, neural networks, Bayesian estimation methods – the list goes on (and on, and on). The question of what algorithm is best for a given ML problem, however, is often less impactful than we might think.
It’s true that some approaches, on some questions, will work better than others. In some cases, this difference can even be quite distinct. However, in my experience it’s been quite rare that one modeling approach will strictly dominate all other options in answering a given ML question.
A useful middle ground in selecting an algorithm, in my opinion, is to build a ‘stable’ of robust modeling approaches that can be built quickly and easily for day-to-day use. Running a battery of models on a given data set allows the data scientist to pick whatever approach has the greatest marginal gain on that particular data set.
However, going far afield for exotic new algorithms, or adopting different programming languages, in my opinion, is rarely necessary or even worth the time.