Machine learning has revolutionized our world. It powers our smart phones, determines which advertisements we view and soon it will dominate the automobile industry through self-driving cars. Even Google now wants all of its engineers to know at least some machine learning.
But how does machine learning compare to a model-first technique? And are these concepts necessarily at odds with each other?
In practice, the machine-learning approach typically goes something like this:
- Identify the response variable
This is the variable to be predicted (e.g., ‘banking transaction is compromised, or not’; ‘patient is infected with the Zika virus, or not’)
- Get data
The more, the better.
- Select features
Features are a particular subset of data, or calculations from data, that are hypothesized to be predictive. For instance, if we’re predicting whether a news story is about politics or sports, the nouns used in the article (e.g., ‘Congress’ or ‘football’) would probably make good features. Words like ‘the’ or ‘a,’ on the other hand, would not.
- Choose a machine-learning algorithm
For classification problems (e.g., ‘patient has Zika, or not’), examples of classification machine learning algorithms are: logistic regression, naive Bayes, neural network and support vector machines. If the response variable is a continuous variable (e.g., prices of houses), algorithms such as a regularized form of linear regression might be appropriate.
- Tune hyper-parameters
In this stage, the machine learning algorithm might generally have parameters that need tuning. These are often selected by a technique like cross validation.
- Test
Finally, it is important to test the performance of the algorithm. Usually, a technique like k-fold cross validation is used to compute metrics such as error rates, precision, recall, etc.
It is not uncommon to find practitioners trying different types of machine-learning model (e.g., a support vector machine instead of logistic regression) before stumbling on a model that seems to work best. I remember attending a natural language processing conference where it seemed as if the theme of every talk was something like: “We tried machine learning algorithm X, but that didn’t work, so then we tried machine learning algorithm Y, but that didn’t work, so we then tried Z, and that seemed to give the best results.”
But, as almost any practitioner will tell you, great features always trump any fancy machine-learning model.
There are dangers in machine learning. One is that correlation doesn’t necessarily imply causality. (For some entertainment, take a look at this to see what I mean.) In fact, some journals have even taken the position of considering banning p-values altogether.
It’s important to realize, however, that machine learning and model-first approaches don’t have to be at odds.
As an example, at one stage, Google engineers were attempting to solve the problem of recommending the right coffee shops by learning from data. From their initial data set, they found that on the whole user satisfaction and distance traveled were negatively correlated. Makes sense, right? Who wants to walk for hours to get their caffeine fix? But when they tried using different machine-learning algorithms, they found that the trend lines produced didn’t match this obvious answer. At this point, they decided to try a model-first approach. They started with the hypothesis that the function they wanted to learn from the data set should be monotonically decreasing (i.e., that less distance walked was always better than more, assuming an equal quality of coffee drinking experience). This very basic model-first approach led to a brand new and vastly superior set of results; Google still learned from the data set, but the calculations were made within the context of the new model.
At Haystax, we’ve witnessed similar benefits of a model-first approach firsthand. For instance, we worked on the problem of detecting compromised transactions for a major international bank. It was hypothesized that specific signatures — such as if an unusual HTML tag name was present — would be indicative of malware. Now of course there are benign reasons why an unusual HTML tag name might appear, like if it was harmlessly injected by customers’ browsers. We exploited such knowledge, using our Fusion modeling framework. We then used the bank’s data to estimate parameters such as the conditional probability that a particular HTML tag name would be observed for a given page URL. And now, although two different HTML tag names might end up with the same anomalous score, our top-level hypothesis will react differently depending on mitigating concepts.
Christopher Bishop also recommends a type of ‘model-first’ approach in a paper on model-based machine learning.
The idea is simply that in any problem you model, you should start with concepts, which can then link to other concepts. If the links are made to have direction, you can determine that one concept causes the other to occur. By constructing this graph, you are using a model-first approach.
Now it may turn out that the topology of the graph is identical to the topology of a hidden Markov model. If this is the case, then by all means, use a hidden Markov algorithm to estimate the parameters. If the model yields poor results, this can actually be exciting because when we revise the model we can review assumptions that are too weak or too strong — or discover if other concepts are now relevant. And if after revision the results improve, we’ve now learned more about the underlying process that generates the observed data.
Note: This is the third article of a five-part series. The first article, Three Security Analytics Approaches That (Mostly) Don’t Work, can be found here. The second article, Making Bayesian Networks Accessible, can be found here.
Matt Johnson is a Data Scientist at Haystax Technology.