Gaussian Processes with Spectral Mixture Kernels to Implicitly Capture Hidden Structure from Data

By Daniel Emaasit, March 20, 2018 | SHARE

For purposes of scientific discovery, the field of insider-threat detection often lacks sufficient amounts of time-series training data. Moreover, the limited data that are available are quite noisy. For instance, Greitzer and Ferryman (2013) state that ‘ground truth’ data on actual insider behavior is typically either not available or is limited. In some cases, one might acquire real data, but for privacy reasons there is no attribution of any individuals relating to abuses or offenses (i.e., there is no ground truth). The data may contain insider threats, but these are not identified or knowable to the researcher (Greitzer and Ferryman, 2013; Gheyas and Abdallah, 2016).

The Problem

Having limited and quite noisy data for insider-threat detection presents a major challenge when estimating time-series models that are robust to overfitting and have well-calibrated uncertainty estimates. Most of the current literature in time-series modeling for insider-threat detection is associated with two major limitations. First, the methods involve visualizing the time series for noticeable structure and patterns such as periodicity, smoothness and growing/decreasing trends, and then hard-coding these patterns into the statistical models during formulation. This approach is suitable for large datasets where more data typically provides more information to learn expressive structure. Given limited amounts of data, such expressive structure may not be easily noticeable. For instance, the figure below shows monthly attachment sizes in emails sent by an insider from his employee account to his home account. Trends such as periodicity, smoothness and growing/decreasing trends are not easily noticeable.

 

Second, most of the current literature focuses on parametric models that impose strong restrictive assumptions by pre-specifying the functional form and number of parameters. Pre-specifying a functional form for a time-series model could lead either to overly complex model specifications or to simplistic models. It is difficult to know a priori the most appropriate function to use for modeling sophisticated insider-threat behavior that involves complex hidden patterns and many other influencing factors.

Haystax Technology conducted a research study to address these problems.

Data Science Questions

Given the above limitations in the current state-of-art, our study formulated the following three data science questions: Given limited and quite noisy time-series data for insider-threat detection, is it possible to perform:

  1. Pattern discovery without hard-coding trends into statistical models during formulation?

  2. Model estimation that precludes pre-specifying a functional form?

  3. Model estimation that is robust to overfitting and has well-calibrated uncertainty estimates?

Hypothesis

To answer these three questions and address the limitations described above, our study formulated the following hypothesis:

By leveraging current state-of-the-art innovations in nonparametric Bayesian methods, such as Gaussian processes with spectral mixture kernels, it is possible to perform pattern discovery without prespecifying functional forms and hard-coding trends into statistical models.

Methodology

To test our hypothesis, a nonparametric Bayesian approach was proposed to implicitly capture hidden structure from time series having limited data. The proposed model, a Gaussian process with a spectral mixture kernel, precludes the need to pre-specify a functional form and hard-code trends, is robust to overfitting and has well-calibrated uncertainty estimates.

(Mathematical details of the proposed model formulation are described in a corresponding paper that can be found on arXiv through the following link: Emaasit, D. and Johnson, M. (2018). Capturing Structure Implicitly from Time-Series having Limited Data. arXiv preprint arXiv:1803.05867.)

A brief description of the fundamental concepts of the proposed methodology is as follows: Consider for each data point, $latex i$, that $latex y_i$ represents the attachment size in emails sent by the insider to his home account and $latex x_i$ is a temporal covariate, such as month. The task is to estimate a latent function, $latex f$, which maps input data $latex x_i$ to output data $latex y_i$ for $latex i$ = 1, 2, $latex \ldots{}$, $latex N$, where $latex N$ is the total number of data points. Each of the input data $latex x_i$ is of a single dimension $latex D = 1$, and $latex \textbf{X}$ is a $latex N$ x $latex D$ matrix with rows $latex x_i$.

The observations are assumed to satisfy:
\begin{equation}\label{eqn:additivenoise}
y_i = f(x_i) + \varepsilon, \quad where \, \, \varepsilon \sim \mathcal{N}(0, \sigma_{\varepsilon}^2)
\end{equation}
The noise term, $latex \varepsilon$, is assumed to be normally distributed with a zero mean and variance, $latex \sigma_{\varepsilon}^2$. Latent function $latex f$ represents hidden underlying trends that produced the observed time-series data.

Our study proposed a prior distribution, $latex p(\textbf{f})$, over an infinite number of possible functions of interest given that it is difficult to know a priori the most appropriate functional form to use for $latex f$. A natural prior over an infinite space of functions is a Gaussian-process (GP) prior (Williams and Rasmussen, 2006). A GP is fully parameterized by a mean function, $latex \textbf{m}$, and covariance function, $latex \textbf{K}_{N,N}$, denoted as:
\begin{equation}\label{eqn:gpsim}
\textbf{f} \sim \mathcal{GP}(\textbf{m}, \textbf{K}_{N,N}),
\end{equation}

The posterior distribution over the unknown function evaluations, $latex \textbf{f}$, at all data points, $latex x_i$, was estimated using Bayes theorem, as follows:
\begin{equation}\label{eqn:bayesinfty}
\begin{aligned}
p(\textbf{f} \mid \textbf{y},\textbf{X}) &= \frac{p(\textbf{y} \mid \textbf{f}, \textbf{X}) \, p(\textbf{f})}{p(\textbf{y} \mid \textbf{X})} = \frac{p(\textbf{y} \mid \textbf{f}, \textbf{X}) \, \mathcal{N}(\textbf{f} \mid \textbf{m}, \textbf{K}_{N,N})}{p(\textbf{y} \mid \textbf{X})},
\end{aligned}
\end{equation}
where:

$latex p(\textbf{f}\mid \textbf{y},\textbf{X})$ = the posterior distribution of functions that best explain the email-attachment size, given the covariates
$latex p(\textbf{y} \mid \textbf{f}, \textbf{X})$ = the likelihood of email-attachment size, given the functions and covariates
$latex p(\textbf{f})$ = the prior over all possible functions of email-attachment size
$latex p(\textbf{y} \mid \textbf{X})$ = the data (constant)

This posterior is a GP composed of a distribution of possible functions that best explain the time-series pattern.

Experiments

Raw data and sample formation

The insider-threat data used for empirical analysis in this study was provided by the Computer Emergency Response Team (CERT) division of the Software Engineering Institute at Carnegie Mellon University. The particular case used is a known insider who sent information as email attachments from his work email to his home email. The pydata software stack including packages such as pandas, numpy, matplotlib, seaborn and others, was used for data manipulation and visualization. The figure below shows that email attachment sizes increased drastically in March and April 2011.

Empirical analysis

In the figure below, the first 10 data points (shown in black) were used for training and the rest (in blue) for testing. The figure also shows that the GP model with a spectral mixture kernel is able to capture the structure implicitly both in regions of the training and testing data. The 95% predicted credible interval contains the ‘normal’ size of email attachments for the duration of the measurements. The GP model was also able to detect both of the anomalous data points, shown in red, that fall outside of the 95% predicted credible interval.

 

An ARIMA model was estimated using the methodology in the statsmodels Python package for comparison. The figure below shows that the ARIMA model is poor at capturing the structure within the region of testing data. This finding suggests that ARIMA models have poor performance for small data without noticeable structure. The 95% confidence interval for ARIMA is much wider than the GP model showing a high degree of uncertainty about the ARIMA predictions. The ARIMA model is able to detect only one anomalous data point in April 2011, missing the earlier anomaly in March 2011.

 

It’s important to note that the machine-learning approach described above is able to predict anomalous data points such as unusually large email attachment sizes, but it does not tell us why this behavior happened. Knowledge about causality is often locked in the brains of domain experts (adjudicators, threat experts, psychologists, HR professionals, etc.) who understand the behavior of humans and the leading indicators of insider threat activity. Data scientists need to capture this domain knowledge and combine it with machine-learned indicators in order to understand this behavior.

At Haystax, we go about this critical step by capturing expertise in a probabilistic (i.e., Bayesian) model and feeding many machine-learned indicators to understand/predict the risk score of an insider. In the next blog post, we will demonstrate this approach by using machine learning to extract more anomalous events related to this user and feed them into a probabilistic model of their risk score.

Daniel Emaasit is a Data Scientist at Haystax Technology. For a more detailed treatment of this study, please see Daniel’s blog.

Source code

For interested readers, two options are provided below to access the source code used for empirical analyses:

  1. The entire project (code, notebooks, data and results) can be found here on GitHub.

2. Click this icon Binder to open the notebooks in a web browser and explore the entire project without downloading and installing any software.

References

  1. Emaasit, D. and Johnson, M. (2018). Capturing Structure Implicitly from Noisy Time-Series having Limited Data. arXiv preprint arXiv:1803.05867.

  2. Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning. The MIT Press, 2(3):4.

  3. Knudde, N., van der Herten, J., Dhaene, T., & Couckuyt, I. (2017). GPflowOpt: A Bayesian Optimization Library using TensorFlow. arXiv preprint arXiv:1711.03845.

  4. Wilson, A. G. (2014). Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian processes. University of Cambridge.

  5. Greitzer, F. L. and Ferryman, T. A. (2013). Methods and metrics for evaluating analytic insider threat tools. Security and Privacy Workshops (SPW), 2013 IEEE, pages 90–97. IEEE.

  6. Gheyas, I. A. and Abdallah, A. E. (2016). Detection and prediction of insider threats to cybersecurity: a systematic literature review and meta-analysis. Big Data Analytics, 1(1):6.