This is the second in a series of blog posts that will look at how common objections to the use of Bayesian networks can be overcome by clear thinking and appropriate models.
The first post showed that even concepts that seem vague or imprecise can be represented in a probabilistic model. This post addresses another common objection, that the knowledge engineering required to specify a Bayesian network is often a prohibitively expensive task.
Specifying a complex Bayesian network does require specifying a large number of parameters, specifically the entries on all of the required conditional probability tables (CPTs). Where do these parameters come from?
In some applications it is possible to learn the parameters from data. This can work, but it is only possible when the data sets required for learning are available. Another possibility is that the parameters are defined through knowledge elicitation from domain experts. This can also work, but it may require an expensive effort. Knowledge engineering may require identifying and obtaining access to one or more domain experts, as well as statistics experts who understand the requirements of the Bayesian network. Multiple knowledge engineering sessions may be required to elicit and then refine the values. It is also possible to learn parameter values by combining expert knowledge with available data.
At the end of the day the model and the parameters do have to be defined, and some potential users are scared away from using Bayesian networks because this step is perceived to be a prohibitively expensive bottleneck. However, in many cases it is possible to dramatically reduce the knowledge engineering effort to develop a model and define the parameters for a Bayesian network. I will illustrate an approach for this by introducing a toy problem, and defining a small Bayesian network to solve it. The approach has three components: an appropriate model, a recognition that neither perfection nor precision is required and an iterative process that builds, tests and refines the model.
The first component is to build an appropriate model. When the problem involves reasoning about things operating in some domain, it often pays to think first about the objects, or agents, in the domain and build a model that represents them, their attributes and the relationships between them. The attributes are typically represented as random variables; the relationships may be random variables or may be represented by the graphical links in the Bayesian network. We do this initially without any regard to what observations we may have or expect to have. Then, once there is a model of the objects/agents in the domain, we extend the model to include the observations that are available to us or that might become available.
The second part of the approach is a willingness to accept – even to embrace – simplicity, and a lack of precision. It is not necessary, especially with the first version of a model, to include every possible random variable in the model, or to require precision in specification of the model parameters. It is important to capture significant relationships, but it’s much easier to get a simple model working and then extend it than it is to create a complex model from scratch.
The next step in the approach is spiral development. We build a small simple model of an important part of the problem, test it by interacting with it to make sure the model responds in believable ways, then make refinements or extensions until a useful model is achieved.
With that introduction to the process, here is the toy problem – which uses the classic ‘blind-men-and-an-elephant‘ example:
Four blind men are walking on the savanna in Africa. They encounter an elephant. The first blind man has bumped into one of the elephant’s legs. He explores it with his hands and says: “I have found a tree.” The second blind man encounters one of the elephant’s ears: “No, it is a large palm leaf.” The third encounters the elephant’s trunk: “It is a python!” And the fourth blind man reaches out and finds the elephant’s tail: “No, you are all wrong – it is just a rope, hanging from a tree.” So, how can we combine these observations and reason that this is an elephant?
To build a model of this problem, first identify the objects, or agents, in the problem domain. We do want to keep things simple at the start, so we can identify that there is some object that the blind men have encountered, and that there are the blind men themselves. Let’s start with the object we wish to reason about: the object that the blind men have encountered. Its key attribute is its type. So we can start with a random variable that represents the type of the object. The object type is a random variable with multiple states. From the problem description, the possible states include: ‘tree,’ ‘palm leaf,’ ‘rope,’ ‘python’ and, of course, ‘elephant.’ In Netica, a commercial Bayesian-network development package from Norsys Software Corp., it looks like this:
Now we consider the blind men. The important attribute for them is their observation of the object. The blind men are the same, so we only need to specify the observation once. The observation is a random variable with four states: ‘tree,’ ‘palm leaf,’ ‘rope’ and ‘python.’ Because it is an observation of the object’s type, we model it in the Bayesian network as a child to the ‘Object Type’ node:
The network above still has the default probability distributions assigned by Netica. To complete the model, we need to define the parameters of the local probability distributions. That is, we need a prior distribution across the states of ‘Object Type,’ and a conditional distribution for the ‘Blind Observation’ given the object type. These numbers are not specified in the problem description, so where do they come from?
It would certainly be possible to devote considerable time and energy to defining the numbers by reviewing literature, conducting surveys, designing and implementing randomized experiments with blind men and African savannas or interviewing experts. In some problems that kind of effort may be appropriate. But for this model, and especially for the early versions of many models, it is not necessary to agonize over the process of defining the numbers needed for the required probability distributions. A lot of anecdotal evidence from constructing many Bayesian network models suggests that reasonable numbers will give reasonable results.
Let’s start with the prior distribution for the object type. What follows is a stream of consciousness thought process that will consider the problem and end up with a prior probability distribution for Object Type:
The model is developed from the ‘world’ defined by the problem description. In that world, we can reasonably assume that at least all of those states do exist, so there will be no prior probabilities of zero. We can envision an African landscape, with scattered trees, where some of them are palm trees. There is at least one elephant and elephants are usually together, in groups. And there must be at least an occasional rope hanging from a branch, plus the occasional python. Mentally examining this imagined landscape, we see lots of trees, a number of palm trees with large leaves and a parade of elephants. We probably can’t see any ropes or pythons, but we know that they are there.
That suggests there are more trees than palm leaves, more of either of them than elephants and the occasional rope or python. We do not need to specify actual probabilities; just articulating likelihoods for the different types is sufficient. What is important is the ratios between the likelihoods we assign to the different states. Let’s say 40 trees, 20 palm leaves, five elephants, and two apiece for ropes and pythons. (Note that a wide range of different numbers will work for this problem.) In the order that we defined the states, that yields the likelihood vector [40, 20, 2, 2, 5]. We can enter these numbers into the distribution table in Netica, and then use Netica’s Table | Normalize function (which scales them so that they sum to 100%) to turn those likelihoods into a prior probability distribution. (The probability distributions in Netica are typically shown as percentages.)
We next need to define the conditional probability distribution for a blind observation given the object type. That is, we must fill out this table:
For each row in the table, we must answer the question: What will a blind man observe if he encounters that object type? It would be possible to conceive of extensive experiments to collect data that would answer this question, or intense knowledge engineering sessions to try to elicit probabilities from knowledgeable experts. But often, especially in the early version of a model, it is possible to employ common-sense reasoning to come up with reasonable values for the needed numbers. As we did above, it is only necessary to specify likelihoods for each row. We can later use Netica to convert the likelihoods into probabilities.
Again, what follows is stream of consciousness for the kind of thinking that can generate the parameters required:
First consider a blind man who encounters a tree. He is likely to recognize through touch that it is a tree, so that outcome should have a large likelihood. Yet all sensors are ‘noisy’ and subject to error – even blind men – so we don’t want to use zero for any of the outcomes. Is there anything that might be confused for a tree? Ok, perhaps a python, if it were hanging from a branch, and was holding still… perhaps that could be confused for a tree, but it wouldn’t happen very often. Now pick some likelihood numbers consistent with that reasoning, say [80, 1, 1, 2].
Next, consider a blind man who encounters a palm leaf. He is likely to recognize that it is a palm leaf. And for this one, there is no other state that might be expected to be confused for a palm leaf. Again, we do recognize that all sensors are subject to error, so we do not wish to use any zeros. We must pick some numbers, so… [1, 80, 1, 1]
Now consider a blind man who encounters a rope, hanging from a branch. In this case it is conceivable that a rope could be confused with a small narrow tree trunk. And plausible that a rope could be confused with a python. Still, most of the time we expect that a rope will be recognized as a rope. And again we do not wish to use any zeros. So pick some numbers… [2, 1, 80, 10].
A blind man who encounters a python may be confused in similar ways as with a rope. A python could be confused with a tree, or even more likely with a rope, but most of the time it will be recognized as a python. We need to pick some numbers, so we might select [2, 1, 10, 80]
Now we get to the last row of the conditional probability table, where we model the blind man encountering an elephant. How to predict what a blind man will report? One possibility is just to count up the opportunities for the different misclassifications that are described in the problem definition. An elephant has four legs, two ears, one tail, and one trunk. We can use those counts as likelihoods [4, 2, 1, 1].
At this point the table has been filled in with likelihoods:
It is not necessary to use these exact numbers. A wide range of numbers will work for this problem. We use Netica’s Table | Normalize function to convert these likelihoods to probabilities (which sum to 100% across each row):
At this point the Bayesian network looks like this:
We can do the first round of ‘testing’ on this model by successively setting each state in the Object Type, and then each state in the Blind Observation to make sure that these two random variables interact with each other in ways that are expected and consistent with the problem domain. If necessary, make changes to the prior or to the conditional probabilities (or likelihoods) until the model ‘feels’ reasonable.
Now we can make three additional copies of the Blind Observation node, to represent the four blind men in the original story.
When we apply the evidence reported from the story, we can see that the Bayesian network has indeed identified the object as an elephant!
[Note: The example Bayesian network discussed in this post, BlindMenAndElephant.neta, is available for download here. The example runs in Netica, a demo version of which is available for free at the Norsys website that is more than sufficient to run the example Bayesian network.]
This Bayesian network was developed for a small yet interesting toy problem, but it has relevance to more complex problems. First, the model was developed logically, starting with a model of the important agents in the domain and their attributes – in this case the object and its type – followed by modeling the observations that are available in the domain – in this case the observations of the blind men.
Most importantly, it has demonstrated that at least in some cases it is possible to define parameters of a non-trivial model without an extensive or expensive knowledge engineering process. Reasonable numbers, defined using logical thinking, common sense and an understanding of the domain, are often sufficient to achieve reasonable results.
This problem, and this Bayesian network, can also be used to illustrate a common misstep that is sometimes made in Bayesian modeling. Suppose that in our original modeling we had decided to model the observation as the parent, and the Object Type as the child. This may even seem reasonable, because that is the way that we think. If we reason from data to inference, it can ‘make sense’ to build the model that way. And if we do, we get this:
The Bayesian network above does not have probabilities assigned; the numbers are just default values from Netica. At first blush, this network may even seem reasonable. But consider what happens when we try to define the probability distributions. Even defining a prior across the states of the blind observations feels awkward. But when we try to define the conditional distribution of the Object Type given four blind observations, we discover that we have to fill in a table with (4 x 4 x 4 x 4 =>) 256 rows.
For each row, we have to answer questions like: “If one observations is ‘tree,’ the second observation is ‘rope,’ the third observation is again ‘tree,’ and the fourth observation is ‘palmLeaf,’ then what is the likelihood that the object is a ‘tree’… a ‘palmLeaf,’… a ‘rope,’ etc.? This does not sound like fun! There are many more parameters, and even understanding them well enough to try to specify them is hard. The lesson here is that if defining the parameters of the model is too painful then that is evidence your model is wrong. It is almost always better to model observations as children of the random variable that are being observed.
There are some other lessons that can be extracted from this toy problem. First, an astute reader may have asked early on: “Where did the elephant in the model come from? That is, why is ‘elephant’ one of the states of the object?” That’s a valid question, since in a realistic problem we may not know that elephants exist until we encounter one. It’s still possible to use a Bayesian network to reason in such a domain, and it is done by explicitly including the state ‘other’ in the model. For example, in this very problem suppose we had the same four blind men and the same observations, but suppose that the possibility of ‘elephant’ had not already been encoded in the model.
Instead, a model can be constructed with five object states: the four that are known – ‘tree,’ ‘palmLeaf,’ ‘rope,’ and ‘python’ – and then a fifth state of ‘other’. The prior distribution of ‘other’ will likely be small, but it should not be miniscule. Then the last row of the conditional probability table for the blind observations will be the probability distribution across the possible observations, given that the object is ‘other’. Without any additional information, we can assign equal probabilities for each observations state. When we apply evidence of the four blind men to this model, we see that the probability of ‘other’ is very high.
If the automated system using this Bayesian network was coded to raise an alert when the probability of ‘other’ exceeded some threshold, a human analyst would at some point have a ‘Eureka!’ moment: “Oh! It’s an elephant!” Then the model could be extended to include the object state of ‘elephant.’ At that point, for completeness, the model should have six states for ‘Object,’ including both ‘elephant’ and ‘other’ – to account for future encounters with other unexpected objects – say, hippos, rhinoceroses or giraffes.
Finally, note that this model is a very simple fusion system, which infers the presence of some (perhaps rare or unexpected) state of the world by fusing observations from multiple sensors. The sensors here are not even ‘aware’ of some important states of the world (i.e., the elephant). This fusion system could be extended to account for sensors with different accuracies (e.g., some blind men are more reliable than others) or for different types of sensors. This model has a prior distribution across the states of the object, but that model could be extended with additional environment variables that are parents to the Object Type node, which would provide different distributions for different locations in Africa, or different times of year, and so on.
Any real-world problems of course will be considerably more complex than this example, with lots of variables and therefore a complex Bayesian network with lots of local-probability distributions that require parameters.
But we still have a reasonable prospect of defining a useful Bayesian network if we:
- Start small, beginning with simple models of the objects or agents that we wish to reason about, and then add the observations that we may have about those objects;
- Use engineering judgment to define reasonable parameters, without worrying about precision in early versions; and
- Test and evaluate the model by interacting with it – or with data if available – and refine as necessary.
Once the simple model gives reasonable results, we can then iterate to add new concepts and relationships until the model is complete enough to be useful.
Ed Wright, Ph.D., is a Senior Scientist at Haystax Technology.