Overcoming Objections to Bayesian Networks - Part 1

By Ed Wright, October 27, 2016 | SHARE

In an earlier blog post we discussed the advantages of using Bayesian Networks as a representation for reasoning in complex problem domains. Despite these advantages, there are many who argue that Bayesian Networks are not sufficient for representing some types of real-world problems. Over the next several blog posts we will discuss some of these objections, and demonstrate that the objections can be overcome by clear thinking and appropriate models.

One objection to the use of Bayesian Networks, or of probability in general, is that the real world is too vague and imprecise to be represented by the strict rules of probability. Proponents of fuzzy logic, for example, argue that many real-world concepts do not conform to the requirements of probability, and therefore require an alternative representation like, well, fuzzy logic. Formally fuzzy logic is a “form of many-valued logic in which the truth variables may be any real number between 0 and 1, considered to be ‘fuzzy’”. In contrast, probability theory requires that the truth value of a proposition must be 0 (false) or 1 (true), with a constraint that the probability of all possible states of a variable must sum to 1. The probability of a proposition is then the proportion of a population for which the proposition is true (frequentist probability) or it is our current belief that a proposition is true (Bayesian, subjective probability).

A simple example will illustrate the issue. Suppose we have data about the height of an individual. The data we have consists of a number of text descriptions of the person that each contain a reference to the person’s height with labels of ‘tall’, ‘medium height’ or ‘short’. The fuzzy logic argument is that there is no crisp definition of these labels. What is ‘tall’ to one observer may be ‘medium height’ to another. The fuzzy solution is to admit that someone can be simultaneously ‘tall’ and of ‘medium height’, or simultaneously ‘medium height’ and ‘short.’ Or potentially even members of all three simultaneously. This is done by membership functions to map specific heights to the three labels. Proponents of fuzzy logic argue that since an individual can be in more than one state at a time (e.g., simultaneously ‘tall’ and ‘medium height’) that this is a violation of the laws of probability – which require that only one proposition can be true – and that therefore a probabilistic model cannot represent this problem.

But in fact we can represent this problem in a Bayesian Network, using probabilities. We start with a variable that represents the true, but unknown, height of the individual. This is a continuous variable, which we can represent in a Bayesian Network as a discretized continuous variable. Then we represent the text labels that come from the descriptions as observations on the true height. (Note: The illustrations below are done using Netica, a commercial Bayesian Network development package from Norsys).

We start with a random variable that represents the individual’s true height.

blog-bn-01

For this simple example there are eight possible discrete states that cover the possible values from 0 to greater than 7 feet. If more precision is required, a finer discretization can be used. The belief bars in the node show the prior distribution, in percentages, that has been assigned to these states. (The specific numbers are neither authoritative nor important; what’s important here is the modeling approach.)

Then we need a node that represents the possible text labels: ‘tall’, ‘medium height’ or ‘short’. This node represents an observation of the true height. To reason in a causal direction, it is the true height that is the cause of the observation, so in the Bayesian Network there is an arc from the True Height node to the Height Label Observation node.

blog-bn-02

Because the Height Label Observation node has a parent, we need to define the conditional probability distribution (CPD) for the Height Label Observation node. For distributions with discrete (or discretized) variables, the CPD is a conditional probability table (CPT). Here is what it looks like in Netica (numbers in percentages).

blog-bn-03

Each row of this table is the conditional probability of Height Label Observation, given the true state of True Height. In the first row, given that the true height is SevenFeetPlus the probability distribution across the possible labels is [100, 0, 0]. So no uncertainty there. Another example is the row for FiveEight-SixFeet: given that the True Height is between 5’8” and 6’, the probability distribution for an observer assigning a height label is [22, 67, 11].

Let’s consider what this CPT means. For any true height, there may be uncertainty in what label a random observer may apply. If we knew something about a specific observer, or about the conditions under which the observation was made, we could tailor a CPT to an individual observer. For example, a short observer is probably more likely to assign a label of ‘tall’, and there would likely be more uncertainty if the observer was a half mile away. This kind of contextual information could be captured as additional context variables in a more complete model.

Using this CPT in the Height Label Observation node, we get this display in Netica:

blog-bn-04

The probability distribution across the states of the Height Label Observation node are calculated (by Netica) from the prior distribution of the True Height Node, and the CPT of the Height Label Observation node. Already we can apply a common-sense filter: this distribution says that our prior belief, before we read the first description, is that the height label is most likely to be ‘medium height’, with a smaller but roughly equal probability of either ‘tall’ or ‘short’. That seems reasonable.

To use these definitions to fuse multiple descriptions, we can make copies of the Height Label Observation node.

blog-bn-05

We then can apply evidence to the different observations. Suppose we have three descriptions where the three height descriptions are: ‘tall’, ‘medium height’ and ‘tall’. Here is what we get:

blog-bn-06

If the observations were ‘short’, ‘short’, and ‘tall’, we would get this:

blog-bn-07

In these examples we end up with a posterior probability distribution across the states of the True Height node. This means that whatever the actual value of the (unknown) true height is, the actual true value will fall in one and only one of the available states. We don’t know for sure which state is the correct one, but we now have an updated posterior belief, given the data and the model, across the possible states of True Height.

[Note: The example Bayesian Network discussed in this post, FuzzyObservation.neta, is available for download here. The example runs in Netica, a commercial Bayesian Network software application developed by Norsys Software Corp.  A demo version of Netica is available for free at the Norsys website that is more than sufficient to run the example Bayesian Network.]

The above example is simple, but it does illustrate that for the same kind of simple problem that is often used to justify the need for fuzzy logic, it is very straightforward to build a Bayesian Network model. Is this a better representation of this problem then a fuzzy logic model? It has the advantages of being an explicit, causal, representation of the problem, with all the advantages of being able to exploit both the laws of probabilities and all of the available statistical techniques.

Critics might ask: “where did the numbers come from?”, referring to the values in the prior and in the conditional probability distributions. In this simple example, the prior on true height could be obtained from population statistics. The conditional probability distributions could be obtained by straightforward statistical experiments or elicitation from experts. In both cases we are trying to estimate a well-known concept – a statistical probability distribution.

The counter-argument about fuzzy methods is: “where do the membership functions come from?” At least with probability distributions we can use the methods of statistics, or expert judgment (i.e., subjective beliefs of experts), to estimate distributions. Additionally, the Bayesian Network model is easy to extend, if needed, with appropriate context variables – for example, for the height of the observer, or for the conditions under which the observations were made.

All of this is not to say that fuzzy logic methods do not work. There have been lots of successful applications of fuzzy logic, and fuzzy methods are appropriate to many problems. But especially when a problem domain is complex – such as insider threat or financial fraud detection – the ability to include concepts that are often considered to be beyond the application of probability, in a consistent probabilistic model, can be very powerful. That is, the above observed height fusion model may be a small part of a much more complex Bayesian Network.

In future posts, we will examine other perceived obstacles to the use of Bayesian Networks, and show how they can be overcome.

Ed Wright, Ph.D., is a Senior Scientist at Haystax Technology.

 

In this article: Analytics, Bayesian networks, Modeling