For purposes of scientific discovery, the field of insiderthreat detection often lacks sufficient amounts of timeseries training data. Moreover, the limited data that are available are quite noisy. For instance, Greitzer and Ferryman (2013) state that ‘ground truth’ data on actual insider behavior is typically either not available or is limited. In some cases, one might acquire real data, but for privacy reasons there is no attribution of any individuals relating to abuses or offenses (i.e., there is no ground truth). The data may contain insider threats, but these are not identified or knowable to the researcher (Greitzer and Ferryman, 2013; Gheyas and Abdallah, 2016).
The Problem
Having limited and quite noisy data for insiderthreat detection presents a major challenge when estimating timeseries models that are robust to overfitting and have wellcalibrated uncertainty estimates. Most of the current literature in timeseries modeling for insiderthreat detection is associated with two major limitations. First, the methods involve visualizing the time series for noticeable structure and patterns such as periodicity, smoothness and growing/decreasing trends, and then hardcoding these patterns into the statistical models during formulation. This approach is suitable for large datasets where more data typically provides more information to learn expressive structure. Given limited amounts of data, such expressive structure may not be easily noticeable. For instance, the figure below shows monthly attachment sizes in emails sent by an insider from his employee account to his home account. Trends such as periodicity, smoothness and growing/decreasing trends are not easily noticeable.
Second, most of the current literature focuses on parametric models that impose strong restrictive assumptions by prespecifying the functional form and number of parameters. Prespecifying a functional form for a timeseries model could lead either to overly complex model specifications or to simplistic models. It is difficult to know a priori the most appropriate function to use for modeling sophisticated insiderthreat behavior that involves complex hidden patterns and many other influencing factors.
Haystax Technology conducted a research study to address these problems.
Data Science Questions
Given the above limitations in the current stateofart, our study formulated the following three data science questions: Given limited and quite noisy timeseries data for insiderthreat detection, is it possible to perform:
 Pattern discovery without hardcoding trends into statistical models during formulation?

Model estimation that precludes prespecifying a functional form?

Model estimation that is robust to overfitting and has wellcalibrated uncertainty estimates?
Hypothesis
To answer these three questions and address the limitations described above, our study formulated the following hypothesis:
By leveraging current stateoftheart innovations in nonparametric Bayesian methods, such as Gaussian processes with spectral mixture kernels, it is possible to perform pattern discovery without prespecifying functional forms and hardcoding trends into statistical models.
Methodology
To test our hypothesis, a nonparametric Bayesian approach was proposed to implicitly capture hidden structure from time series having limited data. The proposed model, a Gaussian process with a spectral mixture kernel, precludes the need to prespecify a functional form and hardcode trends, is robust to overfitting and has wellcalibrated uncertainty estimates.
(Mathematical details of the proposed model formulation are described in a corresponding paper that can be found on arXiv through the following link: Emaasit, D. and Johnson, M. (2018). Capturing Structure Implicitly from TimeSeries having Limited Data. arXiv preprint arXiv:1803.05867.)
A brief description of the fundamental concepts of the proposed methodology is as follows: Consider for each data point, $latex i$, that $latex y_i$ represents the attachment size in emails sent by the insider to his home account and $latex x_i$ is a temporal covariate, such as month. The task is to estimate a latent function, $latex f$, which maps input data $latex x_i$ to output data $latex y_i$ for $latex i$ = 1, 2, $latex \ldots{}$, $latex N$, where $latex N$ is the total number of data points. Each of the input data $latex x_i$ is of a single dimension $latex D = 1$, and $latex \textbf{X}$ is a $latex N$ x $latex D$ matrix with rows $latex x_i$.
The observations are assumed to satisfy:
\begin{equation}\label{eqn:additivenoise}
y_i = f(x_i) + \varepsilon, \quad where \, \, \varepsilon \sim \mathcal{N}(0, \sigma_{\varepsilon}^2)
\end{equation}
The noise term, $latex \varepsilon$, is assumed to be normally distributed with a zero mean and variance, $latex \sigma_{\varepsilon}^2$. Latent function $latex f$ represents hidden underlying trends that produced the observed timeseries data.
Our study proposed a prior distribution, $latex p(\textbf{f})$, over an infinite number of possible functions of interest given that it is difficult to know a priori the most appropriate functional form to use for $latex f$. A natural prior over an infinite space of functions is a Gaussianprocess (GP) prior (Williams and Rasmussen, 2006). A GP is fully parameterized by a mean function, $latex \textbf{m}$, and covariance function, $latex \textbf{K}_{N,N}$, denoted as:
\begin{equation}\label{eqn:gpsim}
\textbf{f} \sim \mathcal{GP}(\textbf{m}, \textbf{K}_{N,N}),
\end{equation}
The posterior distribution over the unknown function evaluations, $latex \textbf{f}$, at all data points, $latex x_i$, was estimated using Bayes theorem, as follows:
\begin{equation}\label{eqn:bayesinfty}
\begin{aligned}
p(\textbf{f} \mid \textbf{y},\textbf{X}) &= \frac{p(\textbf{y} \mid \textbf{f}, \textbf{X}) \, p(\textbf{f})}{p(\textbf{y} \mid \textbf{X})} = \frac{p(\textbf{y} \mid \textbf{f}, \textbf{X}) \, \mathcal{N}(\textbf{f} \mid \textbf{m}, \textbf{K}_{N,N})}{p(\textbf{y} \mid \textbf{X})},
\end{aligned}
\end{equation}
where:
$latex p(\textbf{f}\mid \textbf{y},\textbf{X})$ = the posterior distribution of functions that best explain the emailattachment size, given the covariates
$latex p(\textbf{y} \mid \textbf{f}, \textbf{X})$ = the likelihood of emailattachment size, given the functions and covariates
$latex p(\textbf{f})$ = the prior over all possible functions of emailattachment size
$latex p(\textbf{y} \mid \textbf{X})$ = the data (constant)
This posterior is a GP composed of a distribution of possible functions that best explain the timeseries pattern.
Experiments
Raw data and sample formation
The insiderthreat data used for empirical analysis in this study was provided by the Computer Emergency Response Team (CERT) division of the Software Engineering Institute at Carnegie Mellon University. The particular case used is a known insider who sent information as email attachments from his work email to his home email. The pydata
software stack including packages such as pandas
, numpy
, matplotlib
, seaborn
and others, was used for data manipulation and visualization. The figure below shows that email attachment sizes increased drastically in March and April 2011.
Empirical analysis
In the figure below, the first 10 data points (shown in black) were used for training and the rest (in blue) for testing. The figure also shows that the GP model with a spectral mixture kernel is able to capture the structure implicitly both in regions of the training and testing data. The 95% predicted credible interval contains the ‘normal’ size of email attachments for the duration of the measurements. The GP model was also able to detect both of the anomalous data points, shown in red, that fall outside of the 95% predicted credible interval.
An ARIMA model was estimated using the methodology in the statsmodels
Python package for comparison. The figure below shows that the ARIMA model is poor at capturing the structure within the region of testing data. This finding suggests that ARIMA models have poor performance for small data without noticeable structure. The 95% confidence interval for ARIMA is much wider than the GP model showing a high degree of uncertainty about the ARIMA predictions. The ARIMA model is able to detect only one anomalous data point in April 2011, missing the earlier anomaly in March 2011.
It’s important to note that the machinelearning approach described above is able to predict anomalous data points such as unusually large email attachment sizes, but it does not tell us why this behavior happened. Knowledge about causality is often locked in the brains of domain experts (adjudicators, threat experts, psychologists, HR professionals, etc.) who understand the behavior of humans and the leading indicators of insider threat activity. Data scientists need to capture this domain knowledge and combine it with machinelearned indicators in order to understand this behavior.
At Haystax, we go about this critical step by capturing expertise in a probabilistic (i.e., Bayesian) model and feeding many machinelearned indicators to understand/predict the risk score of an insider. In the next blog post, we will demonstrate this approach by using machine learning to extract more anomalous events related to this user and feed them into a probabilistic model of their risk score.
Daniel Emaasit is a Data Scientist at Haystax Technology. For a more detailed treatment of this study, please see Daniel’s blog.
Source code
For interested readers, two options are provided below to access the source code used for empirical analyses:
 The entire project (code, notebooks, data and results) can be found here on GitHub.
2. Click this icon to open the notebooks in a web browser and explore the entire project without downloading and installing any software.
References

Emaasit, D. and Johnson, M. (2018). Capturing Structure Implicitly from Noisy TimeSeries having Limited Data. arXiv preprint arXiv:1803.05867.

Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning. The MIT Press, 2(3):4.

Knudde, N., van der Herten, J., Dhaene, T., & Couckuyt, I. (2017). GPflowOpt: A Bayesian Optimization Library using TensorFlow. arXiv preprint arXiv:1711.03845.

Wilson, A. G. (2014). Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian processes. University of Cambridge.

Greitzer, F. L. and Ferryman, T. A. (2013). Methods and metrics for evaluating analytic insider threat tools. Security and Privacy Workshops (SPW), 2013 IEEE, pages 90–97. IEEE.

Gheyas, I. A. and Abdallah, A. E. (2016). Detection and prediction of insider threats to cybersecurity: a systematic literature review and metaanalysis. Big Data Analytics, 1(1):6.