Even with advances in machine learning (ML) techniques, computing power, and programming accessibility in the last five years, approaching an ML question can be a daunting process. How do you take reams of messy, unstructured, or incomplete data and turn it into something of value?
Every problem is different, but there are several key steps that are fairly constant in the ML estimation process. Following these steps won’t solve every issue that might arise, but it may help head off some common questions and problems that often arise.
1. Identify the question.
This seems overly obvious – but it’s one of the most common issues we see in the process of machine learning. Algorithms and techniques are only as accurate as the person asking the question, so spending some time formulating exactly what you want to know can save a great deal of trouble down the line.
At its core, a good ML question is specific, both substantively and technically. Generally-stated questions can be a useful starting point for thinking about a given business problem, but when building a predictive engine, both the inputs and the outputs have to be specified very clearly.
- a. General question: “How can we improve customer satisfaction?” This is a good jumping-off point to think about the problem, but it does not translate into a statistical process.
- b. ML question: “What factors are most closely associated with negative customer reviews?” From the general, we identify a specific question: what aspects of the customer experience correlate with negative reviews?
2. Identify the metric you want to estimate or predict.
Again, this is an obvious issue, but one that comes up quite often. ML models, like any other numerical process, deal in concrete values. If you want to measure customer satisfaction, what metrics do you have? Are they explicit (star rankings) or implicit (a drop-off in usage rate)? Can you get a reliable set of previous records (a ‘training set’) that can be used to fit an ML model and predict future values?
3. Identify the general statistical approach that you need to use in answering your ML question.
Say you want to identify relationships between input and output features: you might expect that people who call customer support often, or spend more time on the phone, are more likely to leave critical reviews, but you want to know the degree to which this is the case.
You might want to use a parametric model that focuses on explaining these relationships in simple terms that can be used as rule sets going forward. Alternatively, you might be interested in prediction rather than explanation: you want to reliably identify disgruntled customers and make efforts to keep them interested.
Here, you might want to use a more ‘black-box’ ML model that allows for very complex interdependencies between inputs and outputs in an effort to maximize predictive accuracy at the expense of simplicity.
4. Identify what needs to be done to prepare the input data for analysis.
ML techniques are statistical engines, and as such, they need to work with numbers. If you have significant data stored in ‘messier’ formats such as email records, call logs, security footage, and so on, then these data will need to be converted into a tabular, numeric format.
You might also have issues with missing or mis-coded data: these problems need to be identified and solved as well. This can take significant time and effort – in our experience, expect to spend 80-90% of project time cleaning, munging, and pre-processing data to ready it for analysis.
5. Identify the algorithm and computing platform you plan on using in answering your ML question.
Interestingly, the choice of algorithm is often less crucial than it might seem. There are dozens or hundreds of algorithms that may prove useful in answering your ML question, but nine times out of ten, this question devolves into questions of availability and logistics. Availability deals with how easy it is to put a ML model into production.
What coding platform do your data scientists use? Are there readily-available ML packages or libraries that they can access to accomplish this task? If there are, use them! Logistical issues are often a more salient issue when building an ML system. Is the data set small enough that everything can be run on a laptop? Is it large enough to necessitate distributed computing methods like Hadoop? Identifying logistical issues early on can save both time and money.
This is a lot of work summarized in five deceptively simple steps. However, tackling an ML solution in this rough order is, in our opinion, a good conceptual foundation. Clearly identifying the problem and the metric you want to estimate helps eliminate uncertainty or miscommunication between the data science team and other members of the organization.
Setting up an appropriate statistical method, supported by adequate programming and computation infrastructure will help answer this question clearly and quickly, with a minimum of technical and logistical issues.
Identifying the pre-processing jobs will help set up a realistic timeline to completion, and maximize the leverage your ML approach can get when dealing with various data sources. The last step in this process is the easiest: start the algorithm running, cross your fingers, and hope for the best!