At Haystax, we’re as passionate about data science as we are about software development. Data science underpins the security analytics software we build, most notably in providing the theoretical and mathematical foundations for our probabilistic model-based approach — and in the machine-learning algorithms and other artificial intelligence techniques that we fuse with our models — to help organizations to pinpoint and respond to their most serious threats.

This collection of research papers is intended to reveal some of the data science that has gone into what we call our Carbon model of whole-person behavior, a Bayesian inference network that is at the heart of the Haystax security analytics platform we have deployed in support of operational missions such as insider threat analysis, continuous monitoring of cleared personnel and cyber fraud detection. The papers were written and published over a period of three years, starting in 2014. Four of the five were peer reviewed and presented at leading conferences around the U.S., and one garnered an award when it was presented.

We hope you find the material in this collection informative, and useful — even inspiring — in your own work. The data scientists and software engineers who contributed to these papers have certainly inspired us, as we have transformed their knowledge and expertise into ever evolving operational solutions that address complex real-world problems for the individuals responsible for protecting the security of our nation and the safety of its people.

Read the following peer-reviewed papers from Haystax data scientists:

Target Beliefs for SME-Oriented, Bayesian Network-Based Modeling

Authors: Robert Schrag, Edward Wright, Robert Kerr, and Robert Johnson

Abstract: Our framework supporting non-technical subject matter experts’ authoring of useful Bayesian networks has presented requirements for fixed probability soft or virtual evidence findings that we refer to as target beliefs. We describe exogenously motivated target belief requirements for model nodes lacking explicit priors and mechanistically motivated requirements induced by logical constraints over nodes that in the framework are strictly binary. Compared to the best published results, our target belief satisfaction methods are competitive in result quality and processing time on much larger problems.


Automating the Construction of Indicator-Hypothesis Bayesian Networks from Qualitative Specifications

Authors: Edward Wright, Robert Schrag, Robert Kerr, and Bryan Ware

Abstract: We encode qualitative knowledge for a class of probabilistic reasoning problems as a network of related hypotheses and indicators. One specific application domain concerns reasoning about risk or threat in order to raise an alert or warning. The network’s nodes are propositions, either deterministic summary propositions or indicator-hypothesis propositions. Edges connecting these nodes are influences bearing positive or negative polarity and greater or lesser strength. Given such a qualitative specification, we automatically construct a Bayesian network including quantitative conditional probability tables. We initially developed this methodology and software tools to capture qualitative probabilistic knowledge implicit in the official policy documents associated with a specific domain and have since generalized it to allow subject matter experts or domain analysts to readily address other similar domains. We describe our qualitative representation and the steps we take to automatically construct a corresponding quantitative Bayesian network from it.

Processing Events in Probabilistic Risk Assessment

Authors: Robert Schrag, Edward Wright, Robert Kerr, and Bryan Ware

Abstract: Assessing entity (e.g., person) risk from entity-related events requires appropriate techniques to address the relevance of events (individually and/or in aggregate) relative to a prevailing temporal frame of reference—for continuous risk monitoring, a running time point representing “the present.” We describe two classes of temporal relevance techniques we have used towards insider threat detection in probabilistic risk models based on Bayesian networks. One class of techniques is appropriate when a generic person Bayesian network is extended with a new random variable for each relevant event—practical when events of concern are infrequent and we expect their number per person to be small (as in public records monitoring). Another class is needed when (as in computer network event monitoring) we expect too many relevant events to create a new random variable for each event. We present a use case employing both classes of techniques and discuss their relative strengths and weaknesses. Finally, we describe the semantic technology framework supporting this work.

Probabilistic Argument Maps for Intelligence Analysis: Capabilities Underway

Authors: Robert Schrag, Edward Wright, Robert Kerr, Robert Johnson, Bryan Ware, Joan McIntyre, Melonie Richey, Kathryn Laskey, and Robert Hoffman

Abstract: We describe enhancements underway to our probabilistic argument mapping framework called FUSION. Exploratory modeling in the domain of intelligence analysis has highlighted requirements for additional knowledge representation and reasoning capabilities, particularly regarding argument map nodes that are specified as propositional logic functions of other nodes. We also describe more flexible specifications for link strengths and node prior probabilities. We expect these enhancements to find general applicability across problem domains.

Probabilistic Argument Maps for Intelligence Analysis: Capabilities Capabilities

Authors: Robert Schrag, Joan McIntyre, Melonie Richey, Kathryn Laskey, Edward Wright, Robert Kerr, Robert Johnson, Bryan Ware, and Robert Hoffman

Abstract: Intelligence analysts are tasked to produce well-reasoned, transparent arguments with justified likelihood assessments for plausible outcomes regarding past, present, or future situations. Traditional argument maps help to structure reasoning but afford no computational support for probabilistic judgments. We automatically generate Bayesian networks from argument map specifications to compute probabilities for every argument map node. Resulting analytical products are operational, in that (e.g.) analysts or their decision making customers can interactively explore different combinations of analytical assumptions.

Download the Data Science Packet

Download Free PDF