So far in this series, we’ve reviewed some of the more widely deployed solutions for security analytics, and explained how and why they fail most of the time.
In this post we’re going to flip the script and talk about what does work for our customers, and what will work for you too. We’ve been dropping hints throughout the series, and if you’ve spent much time with us you may well have recognized them.
Before we get to that, though, lets take a moment to review some of the reasons why these approaches typically don’t work. Then we’ll talk about how our approach deals with each of these issues in order to provide a platform for delivering workable and robust security analytics solutions.
Keeping the baby, not the bath water
Bayesian networks, machine learning and rules-based systems are applied successfully in many software systems across many domains, so clearly the techniques themselves are sound. Machine learning, for example, has been used by an internet search company to recognize objects in an image and automatically create captions for them.
But they don’t work for security analytics, and that’s primarily because each technique’s weaknesses have yet to be resolved appropriately for that kind of application.
Security analytics is a complex and technical field, requiring specialized knowledge of network systems, log files and analytic techniques. Similarly, Bayesian networks, machine learning and rules-based systems – as well as the tools needed to implement them – are equally complex. Sadly, security analytics personnel possessing all of these skills are both rare and expensive, and without them your team will naturally be limited by what it knows, or by the challenge of translating its needs to the experts available.
Machine learning and rules-based systems don’t map directly to security analytics problems. Even simple implementations are complex enough to prevent practitioners from understanding solutions based on them. In the case of machine learning, the statistics and math behind each of multiple techniques and the rationale for selecting one technique over another are lost or forgotten once a choice is made.With rules-based systems, the sheer volume of rules creates a cognitive burden that prevents comprehensive understanding. Ultimately, this results in systems that are difficult to grasp, and improve incrementally over time.
As practiced today all three of these approaches are bound by significant constraints. Machine learning is dependent on data and thus is unable to offer solutions in cases where data is scarce or non-existent. Rules-based systems either produce far too many false positives or negatives, or require so many rules as to be inapplicable beyond one specific configuration. Bayesian network-based solutions, on the other hand, force practitioners to operate at the level of individual assertions and prevent larger and more meaningful networks from being constructed.
But despite their limitations, each of these techniques offers unique strengths that would need to be present in an ideal security analytics solution:
- Bayesian networks: Domain conceptual alignment and ability to reason on incomplete data
- Machine learning: Sheer power and ability to cope with massive quantities of data
- Rules-based systems: Intuitive simplicity and ease of getting started quickly
What’s needed is a solution that combines these strengths while also compensating for the individual weaknesses of each approach. Solutions that provide software libraries or toolboxes for developers fail because they are aimed at the wrong audience. For most domain users this is no help at all, and for non-developers they actually make the problem worse by adding a new layer of complexity. The ideal solution would be aimed at domain users and provide simple interfaces for applying these powerful analytic tools to security analytics problems.
At Haystax we’ve taken that user-focused approach, developing a software system called Haystax that exploits the combined strengths of all these approaches while also eliminating their drawbacks.
In Haystax, domain experts can interact with simple tools using their domain concepts and knowledge to create Bayesian network-based security analytics solutions that integrate data directly or transform it via machine learning. Using an easy-to-learn language, both direct rules and probabilistic inference can operate side by side and be integrated into a single model to create easily traceable, transparent and generalizable models for domains of interest. Machine learning and data science techniques can then be applied in simple and separable cases, ensuring that implementation decisions are easy to capture.
Step behind the curtain and let’s have a look at a security analytics solution that does work.
Model first . . .
In science, we learn to start with a hypothesis – an educated guess based on an understanding of a given problem and an expectation about how it works. We then prove or disprove the hypothesis by experimentation, data and analysis. Hundreds of years of experience have taught us that experiments without hypotheses mostly produce the wrong data, and attempting to build understanding from too much of the wrong data will generally yield poor results. To avoid that pitfall, we start with what we know.
We’ve created a simple language that lets domain experts simply state what they know to be true, using their own terms and concepts. The results are easily understandable descriptions of a domain at any level of abstraction. These descriptions can be written, shared, stored, version-controlled and mapped to source material or interviews. When compiled by our system, we get a fully formed, functional Bayesian inference network, or BayesNet – in simple terms, a model. (The image accompanying this post shows part of a BayesNet.)
Our modeling approach is successful because we enable analysts to work in their own domain, using their own terminologies and the qualitative statements they make every day. In addition, we provide a number of relationship ‘patterns’ that capture typical relationship types among concepts. For example one concept may indicate another weakly, strongly or even absolutely (water falling from the sky, for instance, is a strong indicator of rain). Other pattern relationships include inverse, mitigation and summaries.
Rules, such as those found in a rules-based system, can be expressed as absolute indicators on model nodes. In this manner, we merge the techniques of Bayesian inference networks and rules-based systems into a single coherent framework for expressing meaning from data. Our approach is more effective than rules-based systems because we can intermingle rules-based inference with more nuanced probabilistic inference in the same model – perhaps even for the same concept.
… Then go get the data
Creating a model first provides a structure for identifying the data needed to drive the model. This is a valuable step because it enables us to consider the data that’s needed and to ignore the data that’s unlikely to be useful. Only after creating a model do we consider the specific data we might have on hand, and how we can apply it to our security analytics problem.
Data comes in many formats, at different velocities and in widely varying volumes. In any problem, how that data gets applied to the model is a key part of developing a successful solution. Some data, such as personal information about an individual, changes rarely (if ever) and can be used directly in the model. Where a person lives, for example, when considered with salary records, can be an indicator of excessive wealth.
Other data, such as network user activity, changes frequently but is easily accessible. Such data can be highly indicative of malicious activity, but equally it may only be useful when considered against historical activity of the same type. Accessing a network or downloading files, for example, is usually normal activity, but when it happens outside business hours or at a time that’s atypical for a given person it can be an indicator of something more malicious.
In our Haystax Analytics Platform™, we provide software tools for describing an ontology of the things that can exist and how data sources map to create instances of them. This ontology forms the foundation of the solution, enabling data sources to map into the security analytics problem domain and enabling the analytic model to make inferences on ontology instances. The Haystax platform orchestrates all of these actions, enabling the system to operate transparently, consistently and reliably.
Data-source mappings can be direct (e.g., age, birth date or an employment event) or complex in cases where data science or other techniques are required to map the data. Such mappings can be expressed as simple configuration files or source code, as desired, and thus can be arbitrarily complex. Data-source mappings result in instantiations of ontology objects that the system creates and persists over time.
When needed, machine-learning and data-science techniques can be used in data-source mappings to capture specific indicators or to extract indicators based on learned features in data. In the Haystax platform, machine-learning techniques applied to data tend to repeat existing patterns, simplifying their implementation because there are examples to look at. For instance, some data is interesting only because it differs from past data; the platform provides capabilities to implement this case. Other examples of anomaly detection on the platform form the basis of mapping new data in the same way.
Applying new data to the system via the creation of ontology objects triggers evaluation of one or more Haystax inference models. During evaluation, ontology objects map to specific indications in the inference model(s) and are applied considering all aspects of their definition. In this way, two similar events that occur at different points in time can result in dramatically different interpretations by the model.
Typical Haystax security analytics ontologies are developed around a core object such as a person, building or computer system. Around that object, associated objects (e.g., events for a person or computer system) can be indicated by the various data sources and created as part of the mappings defined for the data source. These associated objects provide a fully contextualized description of the core object, and everything we know about it at any given point in time.
Haystax combines all of these tools in a platform that includes parallelization ‘for free’ in the implementation. (We say ‘for free’ because parallelized algorithms typically require additional work to transform them from their single threaded implementations.) No specific considerations or concessions need to be made during model implementation to accommodate scaling, so the best mapping of data to model via the ontology can always be used. With Haystax, even novice implementers are able to create solutions that operate at internet scale.
A better outcome
Our system offers a genuinely different approach to security analytics that enables Haystax users to gain the benefits of complex security analytics techniques without the drawbacks of typical implementations. Haystax combines the benefits of Bayesian inference networks, machine learning and rules-based systems in a platform that provides the best features of each, with technical innovations that eliminate the drawbacks of more conventional implementations.
Rob Kerr is Chief Technology Officer at Haystax Technology.
Note: This is the fifth and final installment of our series on security analytics. Previous articles were:
- Part 1: Three Security Analytics Approaches That (Mostly) Don’t Work
- Part 2: Making Bayesian Networks Accessible
- Part 3: Machine Learning vs. Model-First Approaches to Analytics
- Part 4: Three Weaknesses of Rules-Based Systems