# Principled AI with Probabilistic Machine Learning

By Daniel Emaasit, May 29, 2018 | SHARE

At Haystax Technology, we are proponents and early adopters of principled approaches to machine learning (ML) and artificial intelligence (AI) for cybersecurity.

We use the term ‘principled AI’ to describe what we call our Bayesian AI approach, which is built on the coherent mathematical principles of probability theory, information theory and Bayesian decision theory. These principles help us keep our AI transparent, explainable and interpretable. Most importantly, they enable our systems to quantify uncertainty, unlike the black-box approach of deep neural networks. Our users and followers often hear us evangelize this principled approach through publications and conferences, boot camps and local meetups.

Last month, I gave a presentation titled “Introduction to Probabilistic Machine Learning using PyMC3” at two local meetup groups (Bayesian Data Science D.C. and Data Science & Cybersecurity) in McLean, Virginia. The following is a summary of the concepts we discussed during the meetup.

### General Overview

Many data-driven solutions in cybersecurity rely heavily on machine learning to detect and predict cyber crimes. This may include monitoring streams of network data and predicting unusual events that deviate from the norm. For example, an employee downloading large volumes of intellectual property on a weekend. Immediately, we are faced with our first challenge, that is, we are dealing with quantities (unusual volume/unusual period) whose values are uncertain. To be more concrete, we start off very uncertain whether this download event is unusually large and then slowly get more and more certain as we uncover more clues such as the period of the week, performance reviews for the employee, have they visited WikiLeaks, etc.

In fact, the need to deal with uncertainty arises throughout our increasingly data-driven world. Whether it’s Uber autonomous vehicles needing to predict pedestrians on roadways or Amazon’s logistics apparatus having to optimize its supply chain, these applications are compelled to handle and manipulate uncertainty. Consequently, we need a principled framework for quantifying uncertainty that will allow us to create applications and build solutions in ways that can represent and process uncertain values.

Fortunately, there is a simple framework for manipulating uncertain quantities, and it uses probability to quantify the degree of uncertainty. As Prof. Zhoubin Ghahramani, Uber’s Chief Scientist and Professor of AI at University of Cambridge, put it:

The mathematical language for representing uncertainty is probability theory. So in the same way as calculus is the language for thinking about rates of change, probability theory is the mathematical language for representing uncertainty.

This has resulted in a principled approach to machine learning based on probability theory called probabilistic machine learning. It is an exciting area of research that is currently receiving a lot of attention in conferences (NIPS, UAI, AISTATS), journals (JMLR, Nature), open-source software tools (TensorFlow Probability, Pyro) and practical applications at notable companies such as Uber AI, Facebook AI Research, Google AI, and Microsoft Research.

### Probabilistic Machine Learning

In general, probabilistic machine learning (PML) can be defined as an interdisciplinary field focusing on both the mathematical foundations and practical applications of systems that learn models from data. It brings together ideas from statistics, computer science, engineering and cognitive science as illustrated in the figure below.

Image Credit: http://mlg.eng.cam.ac.uk/zoubin/

In this framework, a model is defined as a description of data one could observe from a system. In other words, a model is a set of assumptions that describe the process by which the observed data was generated. This model can be developed graphically in the form of a probabilistic graphical model as illustrated in the figure below.

The circular nodes above represent random variables for the uncertain quantities (e.g., unusual volume or unusual period) and the square nodes represent the uncertainty over the corresponding quantities (e.g., the probability of unusual volume). The downward arrow shows the direction of the process that generated the data. The upward arrow shows the direction of inference, that is, given observed data we can learn the parameters of the probability distributions that generated the observed data. As we observe more and more data, our uncertainty over the random variables (e.g., unusual volume) decreases. This is the modern view of machine learning according to Prof. Chris Bishop of Microsoft Research.

Learning follows from two simple rules of probability, namely:

• The sum rule: $p(\mathbf{\theta}) = \sum_{y} p(\mathbf{\theta}, y)$
• The product rule: $p(\mathbf{\theta}, y) = p(\mathbf{\theta}) p(y \mid \mathbf{\theta})$

These two rules can be formulated into Bayes Theorem, which tells us the new information we have gained about our original hypothesis (or parameters) given observed data.

\label{eqn:gpsim}
p(\mathbf{\theta}\mid \textbf{y}) = \frac{p(\textbf{y} \mid \mathbf{\theta}) \, p(\mathbf{\theta})}{p(\textbf{y})},

where:

$p(\mathbf{\theta}\mid \textbf{y})$ = the posterior distribution of the hypothesis (or parameters), given the observed data
$p(\textbf{y} \mid \mathbf{\theta})$ = the data likelihood, given the hypothesis (or parameters)
$p(\mathbf{\theta})$ = the prior over all possible hypotheses (or parameters)
$p(\textbf{y})$ = the data (constant)

This PML approach has proven to be preferable to deep learning in many applications that require transparency and oversight. Although deep learning has produced amazing performance on many benchmark tasks in specific applications, such as computer vision and conversational AI (e.g, in the recent Google Duplex), it has several limitations in much more general and broader use cases such as cybersecurity and banking. Deep learning systems are generally:

• Very data hungry (i.e., they often require millions of examples for training)
• Very compute-intensive to train and deploy (i.e., they require cloud GPU & TPU resources)
• Poor at representing uncertainty
• Easily fooled by adversarial examples
• Finicky to optimize: choice of architecture, learning procedure, etc., require expert knowledge and experimentation
• Uninterpretable black-boxes, lacking in transparency and difficult to trust

In contrast, PML systems are transparent and explainable, and do not require lots of data and computer power.

Currently, it is easier than ever to get started building PML systems, thanks to a plethora of open-source software tools called Probabilistic Programming Languages. These include Google’s TensorFlow Probability, Uber’s Pyro, Microsoft’s Infer.Net, PyMC3, Stan and many others.

The following presentation contains a few of the topics that we discussed during the recent meetup. Materials from the meetup, including slides and source code, are provided below.

Daniel Emaasit is a Data Scientist at Haystax Technology. For a more detailed treatment of this subject, please see Daniel’s blog.

#### Source code

For interested readers, two options are provided below to access the source code used for the demo:

1. The entire project (code, notebooks, data, and results) can be found here on GitHub.

2. Click this icon to open the notebooks in a web browser and explore the entire project without downloading and installing any software.

#### References

1. Ghahramani, Z. (2015). Probabilistic machine learning and artificial intelligence. Nature, 521(7553), 452.

2. Bishop, C. M. (2013). Model-based machine learning. Phil. Trans. R. Soc. A, 371(1984), 20120222.

3. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT Press.

4. Barber, D. (2012). Bayesian reasoning and machine learning. Cambridge University Press.

5. Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic programming in Python using PyMC3. PeerJ Computer Science, 2, e55.