Bayesian Networks are among the more sophisticated capabilities applied to security analytics, relying as they do on decision science and statistics rather than software engineering expertise. But in many implementations involving Bayesian Networks the effort seems to focus on backrooms full of PhDs doing the hard math, rather than on tools that make Bayesian Networks accessible to subject-matter experts (SMEs) who are most likely not mathematicians. The resulting ‘black box’ solutions can be hard to comprehend, and often lack a suitable platform and user interface for easy access and operational use by the people who could most benefit from them.
Bayesian Networks, or BNs for short, provide a capability to efficiently represent probabilistic knowledge about a complex problem domain, and then to use that knowledge to reason intelligently in that domain under conditions of uncertain, incomplete or even contradictory data.
A major challenge in employing a BN is in developing the probabilistic knowledge that defines the BN for that problem. One way to develop that knowledge is to learn the BN’s structure or parameters from data. This is feasible only when there is a suitable labeled dataset available. When there is no data available from which to learn, the knowledge in a BN must be derived from domain knowledge. Typically, modeling experts do not have the required domain knowledge and thus must extract that information from an organization’s internal experts, or from policy documents, knowledge bases and/or other sources. This knowledge elicitation is a time-consuming and expensive process, tying up domain experts and delaying implementation.
At Haystax, we have developed a knowledge representation approach that dramatically simplifies the knowledge elicitation process for a common class of BNs, and allows SMEs to express domain knowledge in a natural way that does not require extensive interaction with Bayesian modeling experts.
Our approach involves first identifying the important concepts in the problem domain, then specifying the qualitative relationships between the concepts, and finally using software to automatically assemble the qualitative knowledge into a quantitative BN. The important concepts become binary random variables in the BN. We then qualitatively define relationships between concepts. The most common relationship between concepts is indication, that is, ‘A is an indicator for (or against) B’, or alternatively, ‘A is evidence for (or against) B’. A concept can be an indicator for more than one indicated concept. Other relationships include summary (‘A is a summary of B1, B2, … Bn), mitigation (‘C mitigates the influence of A on B’) and relevance (‘D makes A relevant to B’). For each of the above our approach uses a qualitative specification of the strength of the indication (or evidence), and polarity – that is, positive: evidence for, or negative: evidence against. Once we have a qualitative representation of the knowledge, we have software that automatically assembles the knowledge into a quantitative BN.
The final step of exploitation of a BN is a software architecture designed to support reasoning with analytical models that facilitates extracting evidence from data streams and data sources, applying evidence appropriately to the BN, performing inference in the BN and then presenting relevant inference results to a user.
The Haystax Haystax Analytics Platform™ was built to run Bayesian Networks at scale, ingesting, analyzing and processing large amounts of structured and unstructured data – both streaming and in batches – from a wide variety of data sources. Intuitive interfaces and intelligent alerting ensure the results reach the people who need the information, when they need it.
This approach for defining the domain knowledge for a BN has been used to shortcut the potentially tedious knowledge elicitation process and rapidly build successful BN models for several challenging problem domains. While this approach is not suitable for every domain, we have already successfully used it in a wide range of applications including: identifying insider threats, banking fraud and threatening behavior in social networks.
Note: This is the second article of a five-part series. The first article, Three Security Analytics Approaches That (Mostly) Don’t Work, can be found here. Future articles will assess the other analytic approaches mentioned in the first post.
Ed Wright, Ph.D., is a Senior Scientist at Haystax Technology.