What is the automatic differentiation in Bayesian inference

Structure learning of graph-based models on the basis of distributed knowledge

Table of Contents

1 Introduction
1.1 Classification
1.2 Objectives and structure.

2 Bayesian networks
2.1 Basics
2.2 definition
2.3 Example of a Bayesian network
2.4 Inference in Bayesian networks

3 Construction of Bayesian networks
3.1 The Bayesian approach
3.2 The frequentist approach

4 Machine learning of Bayesian networks
4.1 Learning situations
4.2 Structural learning of Bayesian networks.
4.2.1 Quality measures ...
4.2.2 Search strategies.
4.2.2.1 Simulated annealing.
4.2.2.2 Greedy Hill Climbing
4.2.2.3 LAGD Hill Climbing
4.2.3 Experimental results
4.2.3.1 ALARM data set
4.2.3.2 MEDUSA data set.

5 Integration of distributed knowledge
5.1 Competing merger
5.1.1 Competing fusion via sampling.
5.1.2 Competing fusion via LinOP aggregation
5.2 Complementary Merger.
5.3 Cooperative merger

6 Generation of rule bases using decision networks
6.1 Definition of a decision network.
6.2 Example of a decision network
6.3 Definition of the fuzzy rule base
6.4 A framework for the compilation of decision nets in fuzzy rule bases.
6.5 Implementation of an algorithm for the automatic generation of a rule base using a decision network
6.5.1 Experimental results for the decision network "grain problem" ..
6.5.2 Experimental results for the decision network “Stock exchange and economic situation”.
6.6 Generalization to fuzzy rule bases

7 Summary and Outlook

bibliography

Acknowledgments

I would like to thank Prof. Dr. Bernd Reusch and Dipl.-Inform. Stefan Berlik for supporting this work and many valuable suggestions.

My thanks also go to all employees involved in the WEKA project for providing the development environment for machine learning algorithms. In particular, I would like to thank Dr. Remco Bouckaert, who kindly supported me with the integration of LAGD Hill Climbing in WEKA and is jointly responsible for the fact that this structure learning algorithm for Bayesian networks is now an integral part of the current WEKA version.

I would particularly like to thank Dipl.-Inform. I would like to thank Alexander Holland, who was at my side with stimulating discussions, new perspectives and valuable correction tips during the entire time.

List of figures

2.3.1 Specification of the variables of a Bayesian network

2.3.2 Specification of the network structure of a Bayesian network

3.1.1 Construction process of a Bayesian network

4.1 Learning Bayesian networks using databases

4.2.1.1 Test and training errors depending on the model complexity

4.2.1.2 The bias-variance dilemma

4.2.2.1 The three edge operations in the defined neighborhood concept

4.2.2.3.1 Main dialog of LAGD Hill Climbing for setting all important parameters

4.2.2.3.2 The main method of LAGD Hill Climbing

4.2.2.3.3 The central method getOptimalOperations

4.2.3.1.1 The ALARM network.

4.2.3.1.2 Typical learning curve of LAGD Hill Climbing using the example of the ALARM data set

4.2.3.1.3 Debug outputs from LAGD Hill Climbing, here using the example of the ALARM data set

4.2.3.1.4 Results of structure learning with different algorithms on the ALARM data set

4.2.3.1.5 Dependence of the quality of the calculated network structure on the look-ahead depth using the example of the ALARM data set

4.2.3.1.6 Dependence of the quality of the calculated network structure on the number of good operations using the example of the ALARM data set

4.2.3.2.2 Dependence of the quality of the calculated network structure on the number of good operations using the example of the MEDUSA data set.

4.2.3.2.3: Results of the classification of a Bayesian network learned on the basis of LAGD Hill Climbing using the example of a test series of 74 cases from the MEDUSA case database

5.1 Fusion of distributed knowledge in the form of Bayesian networks

5.1.1.1 Results of the sampling aggregation in the case of the Asia network

5.1.1.2 Results of the LinOP aggregation in the case of the ALARM network

6.2.1 Mr. Parker's grain problem as a simple Bayesian network

6.2.2 The complete qualitative representation of Mr. Parker's grain problem as a decision network.

6.5.1 Implemented methods of the DNCompiler.java class

6.5.2 The main method compile for compiling the decision network myNet into an efficiently evaluable rule base

6.5.3 The getOptimalRuleBase method for determining an optimal rule base

6.5.4 The getOptimalRuleToAdd method for determining a rule that contributes most to increasing the quality of the rule base.

6.5.5 The getComplexityOfRuleBase method for determining the complexity of a rule base

6.5.6 The getQualityOfRuleBase method for determining the quality of a rule base

6.5.1.1 The complete qualitative representation of Mr. Parker's grain problem as a decision network

6.5.1.2 Output of the DecisionNetworkCompiler for the decision network "grain problem"

6.5.2.1 The complete qualitative representation of the decision network "Stock exchange and economic situation"

6.5.2.2 Output of the DecisionNetworkCompiler for the decision network "Stock exchange and economic situation"

6.6.1 The complete qualitative representation of Mr. Parker's grain problem as a decision network

6.6.2 Design of the fuzzy controller

6.6.3 Defined linguistic terms using the example of the linguistic variable "kinked"

6.6.4 Output of the DecisionNetworkCompiler for the decision network "grain problem"

List of tables

2.3.1 P (FamilieAusserHaus)

2.3.2 P (stomach problems)

2.3.3 P (AussenlichtOn | FamilieAusserHaus).

2.3.4 P (HundImGarten | FamilyAuserHaus, stomach problems)

2.3.5 P (barking dogs | HundImGarten)

4.2.2.1 The number of directed acyclic graphs g (n) as a function of the number of nodes n

4.2.3.1.1 Results of structure learning with different algorithms on the ALARM data set.

4.2.3.2.1 Results of structure learning with different algorithms on the MEDUSA data set

6.2.1 P (dryness).

6.2.2 P (illness)

6.2.3 P (kinked | drought, illness)

6.2.4 P (illness_1 | illness, treatment)

6.2.5 P (dryness_1 | dryness)

6.2.6 P (kinked_1 | dryness_1, disease_1).

6.2.7 utl (harvest).

6.2.8 utl (costs)

Chapter 1 Introduction

One of the most interesting areas of artificial intelligence is the subject of "rational action". After this PAGE concept [RNo95] an agent lives in an environment (Environment) and forms perceptual sequences (Percepts) on actions (Actions) starting to reach a specific destination (goal) to reach. Obviously, different agents are characterized by the function that Percepts maps to actions. What is meant by “rational action”?

In short, “acting rationally” means doing the “right thing”. But what is the “right thing” in a given situation? Does the agent have to be omniscient and absolutely successful after performing the "correct" actions? Certainly not, "Rational action" rather means, on the basis of the given information, to carry out the actions that are expected to contribute the maximum to achieving the goal.

The game of chess has been one for decades the Benchmarks in Artificial Intelligence. The PAGE concept can also be used here: The environment consists of the playing field, time, the opponent, past and current positions on the playing field, which are created by moving figures (the actions), are part of the perception. The goal is of course to win, perhaps in as few moves as possible or as elegantly as possible.

Now in chess it is the same as that of the entire environment observable is, so the agent has access to all information about the environment, the environment is still deterministic, that is, the next state of the environment can be reliably predicted from the current state and the current intervention.

In the real world, however, it is precisely these properties that cannot usually be guaranteed; the agent is dealing with partial observability of the environment, imprecise sensor data and the uncertainty in the result of his actions. Under these conditions and with uncertain knowledge (instead of a concrete state, for example, there is only a probability distribution of the possible states), he has to select action sequences that achieve the maximum (typically, in this context, one orientates oneself on scoring functions which assign scores to individual actions ) contribute to achieving his goal - obviously a much more difficult situation than playing chess.

This situation is also reflected in current benchmarks. After the legendary IBM computer Deep Blue defeated the reigning world chess champion Garry Kasparow for the first time in 1997, new areas were sought in which computers can be measured against human performance. A current benchmark in artificial intelligence is robot soccer.

Robotic soccer is a perfect example of a real world scenario where reasoning under uncertainty matters. The individual robot soccer players (agents) have in most of the am RoboCup (the annual world championship in robot soccer) participating leagues only have the opportunity to perceive excerpts from the area; so do not have a complete worldview. Furthermore, the sensor data that are calculated on the basis of the camera images are often not only very inaccurate, but in extreme cases can even be simply wrong. The result of desired actions, for example a rotation by 90 °, cannot be predicted deterministically; rather, more or less large deviations (with a rotation of 3 ° and more, for example) have to be accepted. In addition to the described partial observability of the environment, the imprecise sensor data and the uncertainty in the result of the actions, there is another aggravating factor with robot soccer: it is a so-called one Multi-agent system [Mer99]. This means that not only the action sequences of an individual agent have to be selected in such a way that they contribute the maximum to achieving the goal, but the entire behavior of the team has to be included in the function to be optimized.

If you go back to the time before the first calculators, solving a math problem like [Figure not included in this reading] would certainly have been seen as a form of intelligence, especially if you are able to do the square root to be carried out very quickly in terms of time. After everyone had a computer that could solve this task in a very short time, nobody called it intelligence anymore. It is similar with the game of chess: Before the reigning human world champion in chess was defeated by a machine, passing an opponent from Kasparov's class was considered evidence of artificial intelligence. Since 1997 at the latest, this has no longer been viewed as intelligence, but rather devalued by nasty tongues as “simply executing a predetermined sequence of commands”. But is human thinking so different from this?

Currently, competing against the human world champion in football is viewed as a form of artificial intelligence. To date, no robot team has won or even played against a human team. This long-term goal should be achieved in 2050 (see, for example, [BMa03]). Before the threshold of what is considered intelligence (and what is not) is raised to this level, there are still many problems to be solved in the software engineering area, especially when dealing with uncertain knowledge.

1.1 Classification

A few decades ago, it was important in science in general and in artificial intelligence in particular to avoid the factor of uncertainty as much as possible. However, this attitude has changed fundamentally in recent years. Rather, dealing with uncertainty is now regarded as an important factor in modeling, which in many situations makes it possible to create suitable approximate models whose complexity is kept within limits.

Artificial intelligence offers a wide variety of approaches to dealing with uncertainty. Key words are, for example: Non-monotonous logic, rules with uncertainty factors, fuzzy logic or graph-based representations such as Bayesian networks and decision-making networks, which are the central study objects of the present work.

Bayesian networks in particular have been used as the graphical framework that combines various aspects of artificial intelligence, overcomes the inadequacies of closely related neural networks (e.g. with regard to their interpretability) and once the Will be the key to successful intelligent applications and products. Bill Gates himself summed it up in a 1996 interview with the LA Times: "Microsoft's competitive advantage is its expertise in Bayesian networks." [Hel96] .

In short, a Bayesian network is a graphical model for probabilistic relationships between a set of variables of a certain domain and is used for reasoning under uncertain knowledge (similar to human reasoning) and for dealing with new information. The probabilities used indicate the “strength” of the (not necessarily) causal, asymmetrical relationships between the individual variables.

A whole class of examples for the successful application of simple Bayesian networks are so-called Naive Bayes Nets, used as an email filter. In view of the flood of spam, these are used in more and more programs and are usually even capable of learning, so that they can be individually adapted to the user and new spam mails.

In addition to inferences, Bayesian networks are also often used for the exact opposite: the determination of the most likely explanation for the occurrence of certain effects. “Pathfinder” [Hec91] and “Vista” [Hor95] are to be mentioned here as complex examples of this application.

Pathfinder is a program for the medical diagnosis of lymph node diseases, based on a Bayesian network, which links 60 diseases with 100 symptoms over more than 14,000 probabilities, and which now surpasses the world's best (human) experts in diagnosis. Newer versions of Pathfinder are even able, based on the current evidence, to suggest the tests that promise the highest expected benefit in terms of reliable disease diagnosis.

The Vista system has been used for years at the NASA Mission Control Center in Houston, Texas for the time-critical evaluation of live telemetry data and provides the probabilities of errors in the drive systems of a space shuttle. VISTA also goes beyond a pure Bayesian network and also relies on decision-theoretical extensions that make it possible to propose actions of the highest expected benefit and dynamically show the most relevant information on a display.

1.2 Objectives and structure

In the present work, the necessary fundamental probabilistic basics with regard to Bayesian networks are explained first, in order to then introduce them formally.

On this basis, construction methods for Bayesian networks in general and the use of machine learning processes in particular are discussed. In this context, the aspect of structural learning in Bayesian networks is to be studied in particular. After structuring the approaches found in the literature, currently researched structure learning algorithms are discussed and compared, but also the development and subsequent implementation of an own algorithm is presented. A runtime analysis and empirical tests on a synthetically generated database and an acquired database from the medical application area round off this central chapter.

In a further chapter, possibilities for bringing in expert knowledge are discussed; the fusion of distributed knowledge is particularly interesting in this context. This is about the integration of the - possibly contradicting - knowledge of several (human) experts coded in Bayesian networks on the one hand and (sub) networks generated on the basis of machine learning processes (which are based on empirically obtained data in Case Databases) on the other hand. For example, this is often the case when several doctors with distributed knowledge (sometimes also based locally at different locations) form a team of specialists and have to make decisions.

Subsequently, possibilities are discussed how rule-based systems such as IF-THEN rule bases can be generated on the basis of a (learned) decision network. The background to this is that Bayesian networks often cannot be used in resource-limited environments for reasons of efficiency. A so-called compilation of a decision network based on a Bayesian network into a rule base enables the use of decision models based on Bayesian networks in time-critical domains and thus contributes to an expansion of the Bayesian network application area.For reasons that will still be discussed, fuzzy rule bases are particularly useful here. After a brief introduction of decision nets on the one hand and fuzzy rule bases on the other hand, a chapter follows, which deals with the compilation of decision nets in fuzzy rule bases. In this context, a framework for compilation is derived and a pseudo-algorithm for solving this problem is presented. A concrete implementation of an algorithm based on this framework is presented together with the first results in the last two subsections.

Chapter 7 briefly summarizes the main results of the present work and gives an outlook on some aspects that could be studied in more detail in future work.

Chapter 2 Bayesian Networks

In Artificial Intelligence, there are a wide variety of approaches to the representation of uncertain knowledge and inference under it. One of these approaches, which has gained increasing importance in recent years and is currently the area of ​​active research, are Bayesian networks.

Bayesian networks offer the possibility of compactly depicting probabilistic relationships with the help of a graphic model and performing inference on this basis. Since Bayesian networks are very similar to human reasoning and the handling of new information, they are particularly suitable for solving complex problems. The simple graphic structure also contributes to the fact that, on the one hand, the information encoded in a Bayesian network can be easily understood by people and, on the other hand, any decision scenarios from the real world can be conveyed to a computer system without great difficulty.

Due to the close connection with stochastics on the one hand and graph theory on the other hand, some basic terms are first defined in compact form in the following, in order to finally introduce Bayesian networks on this basis and explain them using an example.

2.1 Basics

A graph [Figure not included in this reading sample] a pair of figures is not included in this reading sample, consisting of a finite set of nodes [Figure not included in this reading sample] and a set of edges Figure not included in this reading sample. If [illustration not included in this reading sample] is a set that consists of ordered pairs [illustration not included in this reading sample], one speaks of a directed graph. [Figure not included in this excerpt] is also used in this context parent mentioned by [Figure not included in this excerpt]. The set of all such parents of a node [illustration not included in this reading sample] is denoted by [illustration not included in this reading sample]. A (directed) path in [Figure not included in this reading sample] a sequence of edges of the form Figure is not included in this reading sample; it connects the node [Figure not included in this reading sample] with Figure not included in this reading sample. In the case [illustration not included in this excerpt] it is a question of one cycle. A acyclic directed graph does not contain any such cycles. [Figure not included in this excerpt] is successor from [Figure not included in this sample] if and only if a path from [Figure not included in this sample] to [Figure not included in this sample] exists.

A Random variable can generally have several conditions [Figure not included in this excerpt]. A continuous set of states is also possible, but the present work is limited to the consideration of so-called discrete random variables, that is, random variables that can only assume a finite number of states. [Figure not included in this reading sample] describes the probability that the random variable [Figure not included in this reading sample] will assume the state [Figure not included in this reading sample]. A complete common probability distribution describes the probability values ​​for all possible assignments. Figure not included in this reading sample. Here, [Figure not included in this reading sample] is often represented in a simplified manner by [Figure not included in this reading sample]. Bayesian networks offer the possibility of compactly specifying a complete common probability distribution over a set of random variables [illustration not included in this reading sample]. The conditional probability [Figure not included in this excerpt] is defined as [Figure not included in this excerpt] if the figure is not included in this excerpt. The random variable [figure not included in this reading sample] is (stochastically) independent of [Figure not included in this excerpt] exactly when the figure is not included in this excerpt. [Figure not included in this excerpt] and [Figure not included in this excerpt] conditionally independent given Figure is not included in this excerpt if the image is not included in this excerpt.

This connection is of central importance in Bayesian networks. For example, with appropriate domain knowledge, it is possible to simplify the probability that you have a hole in your tooth under the observations that firstly you have a toothache and secondly Germany will become soccer world champion as follows:

Figure not included in this excerpt

This exploitation of the conditional independence between the various variables of a domain forms the basis for the enormous savings in storage space that can be achieved with Bayesian networks compared to complete common probability distributions.

2.2 definition

As already indicated in the previous section, Bayesian networks are used for the compact representation of complete common distributions. If one assumes that every random variable is binary (i.e. only has 2 possible states) and one further assumes that the set of states only consists of 32 states, the corresponding table, which represents the common probability distribution, already has more than 4 billion entries! In general, there is an exponential memory requirement of [illustration not included in this reading sample], where [illustration not included in this reading sample] corresponds to the greatest cardinality of the value ranges and [illustration not included in this reading sample] the number of random variables.

Bayesian networks can, in many situations, reduce the memory requirement to a manageable level by using existing (conditional) independence between the random variables of the domain under consideration.

Definition 2.2.1 (Bayesian network): A Bayesian network consists of the following components:

- a set of random variables represented by nodes and a set of directed edges between them, which together form an acyclic directed graph
- for each random variable [figure not included in this reading sample] there is an associated table of conditional probabilities (conditional probability table, CPT) Figure not included in this reading sample, where [Figure not included in this reading sample] denotes the set of parents of [Figure not included in this reading sample]

If one adheres to the conventions in the following algorithm when constructing a Bayesian network, any entry of the complete common distribution can be easily reconstructed from the associated Bayesian network.

Algorithm 2.2.1 (formal construction of a Bayesian network):

1. Select the random variables relevant to the domain under consideration [figure not included in this reading sample].
2. FOR [Figure not included in this excerpt]

Figure not included in this excerpt.

To calculate any entry of the complete joint distribution based on a Bayesian network, it is first necessary to use the so-called Product rule to introduce. It states that [Figure not included in this excerpt] (This is just an alternative formulation of the conditional probability.). By repeatedly applying the product rule, the Chain rule can be derived:

Figure not included in this excerpt

The individual factors in the product of the last line can be simplified as follows due to the construction of a Bayesian network:

Figure not included in this excerpt.

Viewed semantically, this equation says nothing else than that any entry of the common probability distribution can be represented and calculated as a product of correspondingly documented local conditional probability distributions.

Since the conditional probability tables in most cases require significantly less storage space than the corresponding explicit storage of all entries of the complete common probability distribution, such a Bayesian network makes it possible to operate inference where the naive approach fails (due to lack of storage space) .

In summary, it can be said that Bayesian networks are an elegant way to represent uncertain knowledge in a compact manner, in particular, storage space that increases exponentially with the number of variables is usually not required. This also offers considerable advantages in the construction of larger and therefore more realistic Bayesian networks for practical applications.

2.3 Example of a Bayesian network

After the formal introduction of Bayesian networks, a simple example by Charniak should be given at this point for better illustration, which vividly clarifies the essential terms and notations.

For example, suppose you come home in the evening and want to know if the family is home before entering the house. We know that the family often leaves the outside light on when the house is vacated, but sometimes also when a guest is expected. There is also a dog in the family. You can often hear him barking in the garden, but he is usually only there when he has stomach problems or the family is away. [Cha91]

So much for the situation, now for the construction of the Bayesian network: First, the variables relevant for this domain, including their value ranges, are defined and, if possible, arranged according to causality. In the example, there would first be the variable “family outside home”, then “stomach problems”, “outside light on”, “dog in the garden” and “dog barking”. All variables are binary and can take the values ​​“true” or “false”. Figure 2.3.1 shows the variables graphically (note the arrangement of the nodes in levels, which are ordered according to causality from top to bottom.).

Figure not included in this excerpt

Figure 2.3.1: Specification of the relevant variables

The next step is to define the network structure. Typically, this is based on the causal interpretation of the edges, but it must be noted that the resulting graph must not contain any cycles. This results in the following network topology in the above context (see Figure 2.3.2). Directed edges are inserted exactly between variables (represented by cyan-colored ellipses) that show a causal relationship. The orientation of the edge is chosen so that it points from the cause to the effect.

Figure not included in this excerpt

Figure 2.3.2: Specification of the network structure

Finally there are the so-called a priori probabilities the root node (these express beliefs before or when no further information is available) and the conditional probabilities of the conditional probability tables. Tables 2.3.1 to 2.3.5 show a possible assignment.

Figure not included in this excerpt

Table 2.3.1: P (family outside house)

Figure not included in this excerpt

Table 2.3.2: P (stomach problems)

Figure not included in this excerpt

Table 2.3.3: P (outside light on | family outside house)

Figure not included in this excerpt

Table 2.3.4: P (HundImGarten | FamilyAuserHaus, stomach problems)

Figure not included in this excerpt

Table 2.3.5: P (barking dogs | HundImGarten)

In this phase it is expressed that the causal connections are not absolute, rather the directed edges merely indicate possible (causal) connections. For example, it is entirely possible that the outside light is on even though the family is at home.

Of course, it is also conceivable to set up such a complex network (in terms of the number of variables and density) that only consists of absolute causal connections, but this contradicts the character of Bayesian networks, which are used precisely when not all Understands relationships in the domain under consideration and is therefore dependent on probabilities.

Once the Bayesian network has been set up, it can be used, for example, to make any queries and the output is the probability that the query variable will assume a certain value (depending on the given observations).

Here, too, a short example for a better understanding: You observe that the outside light is on (Outside light On = true), but you don't hear a dog barking (dog barking = false), and on the basis of this evidence you want to know how likely it is that the Family is away from home (FamilieAusserHaus = true). The concrete probability value P (FamilieAusserHaus = true | AssenlichtAn = true, barking dogs = false) (in this case about 0.500551727589584) can be calculated using inference mechanisms, which leads to the next chapter.

2.4 Inference in Bayesian networks

When carrying out inference given a complete common probability distribution of the random variables [figure not included in this reading sample], one generally has the following initial situation:

A set of query variables [illustration not included in this reading sample] and a disjoint set of evidence variables illustration not included in this reading sample. The amount of remaining hidden (hidden) Variables are not included in the illustration in this reading sample.

Typically one is interested in the a posteriori joint distribution of the query variables [figure not included in this sample] with observed occupancies [figure not included in this sample] of the evidence variables figure not included in this sample, i.e. not included in this sample. According to the definition of the conditional probability:

Figure not included in this excerpt.

By adding up the common entries, using all possible assignments for the hidden variables, you get:

Figure not included in this excerpt.

([Figure not included in this reading sample] here is a normalization constant that is not relevant for the pure weighing up between the various options. The individual summands are used in this context Marginal probabilities called .) This summation of marginal probabilities is also known as marginalization.

Applied to the example in the previous chapter, the calculation of P (FamilieAusserHaus = true | Outside light on = true, Dog barking = false) the following quotient:

Figure not included in this excerpt

In the case of a Bayesian network, there are no major differences in the inference compared to the case of the complete joint distribution, because this can - as explained above - be calculated by multiplying individual conditional probabilities. With the help of the joint distribution, any inquiries of the form [illustration not included in this reading sample] can be answered. In this respect, this important property of complete joint distribution is not lost in the transition to Bayesian networks.

Often one is also in the situation that one has set up a Bayesian network, has already entered certain evidence and then new knowledge in a so-called Knowledge Base flows in. These new evidence must then be entered in the Bayesian network and propagated through, so that the (conditional) probabilities of all CPTs are up to date (based on the current evidence in the knowledge base). So here it is not about the concrete querying of certain probabilities, but about an update of all probabilities of the entire network. This process is under in the literature Belief updating and both exact algorithms and efficient approximation algorithms are known. Compare for example [Jen01].

It should also be noted that the problem of general queries in any Bayesian network (and thus the belief updating of an entire network even more) is NP-complete, which is easy to understand when one considers that in [Figure in this reading sample not included] has to be summed up over all occurrences of the hidden variables [illustration not included in this reading sample] and this can be exponential (in the number of variables) depending on the cardinality of [illustration not included in this reading sample].

Chapter 3 Construction of Bayesian Networks

A prerequisite in the previous chapters was always that the complete common probability distribution was already given. But where do these probabilities come from?

In the introductory example in Chapter 2.3, the required probabilities were determined relatively arbitrarily. In general, there are two fundamentally different approaches: On the one hand, there are the " Frequentists ", Who are of the opinion that the probabilities can only be determined from experiments, on the other hand stand the" Subjectivists " (so-called " Bayesians “), Who take the position that the probabilities characterize the (subjective) beliefs of an agent and therefore do not necessarily have to reflect the conditions in the real world. The boundaries between these two opposing views are fluid. For example, in determining the probability that the sun will still exist tomorrow, one could argue as follows [Hum46]:

- The probability is undefined because there has not yet been an experiment that will check the existence of the sun tomorrow (i.e. on the specific day).
- The probability is 1, because in all experiments in the past (does the sun exist today?) The sun existed.
- The probability is not included in the illustration in this reading sample, where [illustration not included in this reading sample] is the proportion of stars in the universe that explode per day.
- The probability is not included in the illustration in this reading sample, where [illustration not included in this reading sample] is the number of days that the sun has already existed. (after Laplace)
- The probability can be estimated from the type, age, size and temperature of the sun, although one has never seen another star with exactly such properties.

The first three approaches are frequentistic, while the last two can be assigned to the Bayesian approach. It is interesting here that even with the frequentist approaches subjective assumptions have to be made, which illustrates the fluid boundaries between the two extremes.

3.1 The Bayesian approach

With this approach, named after its inventor Thomas Bayes (1702-1763), probabilities are used as a measure for the assessment (degree of belief) a person interprets the occurrence of certain events (which are modeled by the respective random variables). As a result, a probability no longer represents an objective property that can be determined by physical experiments, but mutates into a subjective variable that is dependent on the person who determines it.

Nevertheless, this approach is preferred because it offers the advantage, which should not be underestimated, of being able to assign probabilities to events for which no repeated random experiments can be carried out.

In most cases, the probabilities are determined (subjectively) by a group of experts, as is the case with the aforementioned Pathfinder system, for example. Of course, with larger Bayesian networks, the problem arises again that not only do many probabilities have to be kept in the memory exponentially in the number of random variables used, but they also have to be determined first. Furthermore, in view of the immense number of probabilities required, it is entirely possible that the expert group has difficulties in determining certain entries of the complete joint distribution.

The solution lies in the compact representation of the complete common distribution using a Bayesian network. Instead of specifying each individual entry of the complete joint distribution, only the probabilities required for the conditional probability tables are specified. These can usually be set up by experts without great difficulty (if only due to the fact that there are usually considerably fewer than in the case of the joint distribution) and, as shown above, are sufficient to be able to calculate any entry of the complete joint distribution.

In this respect, the following phases are typically run through to construct a Bayesian network:

1. Specification of the variables: First of all, the random variables relevant for the domain are determined by a group of experts. Even in this step, it is important to weigh up between a (too) high number of variables, which in extreme cases makes inference in the resulting Bayesian network impossible, and a moderate number of variables, which may lead to a Bayesian network with the certain relationships cannot be adequately explained.
2. Specification of the graph structure: This is done by inserting or omitting directed edges between the nodes set up in 1.. Typically, one orientates oneself here on the causal interpretation of a directed edge between two random variables and inserts the corresponding edge if and only if the expert group sees a direct causal connection between the two variables involved. When inserting the edges, particular care must be taken that no cycles are created.
3. Specification of the conditional probability tables: The conditional probabilities of the conditional probability tables, which result from the structure set up in 2., are determined by a group of experts. This phase often decides whether the Bayesian network can be set up or has to be revised at least locally due to insecure or missing entries in the CPTs (due to a lack of expert knowledge).
4. Application of the Bayesian network: After the application of the Bayesian network following the three main construction phases, the experience gained in this way can possibly be used to revise the constructed network.

Figure not included in this excerpt

Figure 3.1.1 illustrates the construction process graphically.

The following example should be mentioned regarding the time required for the individual phases: To set up the Bayesian network for the Pathfinder system mentioned in the introduction, 8 hours were required to define the variables, 35 hours to set up the network topology and a further 40 hours to define the required conditional ones Probabilities.

Figure not included in this excerpt

Figure 3.1.1: Construction process of a Bayesian network

Although such subjective determinations of the network structure (phase 2) and conditional probabilities (phase 3) seem strange, studies have shown that experts are generally quite good at setting these things up. In a study [SFB89], the probabilities that were determined by doctors for setting up a Bayesian network were compared with the probabilities that were calculated on the basis of a case database acquired in practice. The result: The expert group was very close to the experimentally determined probabilities, but it was noticed that the doctors were typically too quick with regard to the fact that certain things have a probability of 0.

On the other hand, one can cite enough other studies that prove that even experts in the domain under consideration have difficulties in specifying the exact conditional probabilities, especially when the number of parent nodes is large (compare [DGa00] and [KST82 ]).

Often, however, you are also able to access databases with certain Training cases to fall back on. These data, which are mostly obtained experimentally, can then be incorporated into the construction of Bayesian networks Machine learning problem for Bayesian networks leads:

Definition 3.1.1 (machine learning problem for Bayesian networks): The following quantities and dimensions are given:

1. A lot of random variables figure not included in this reading sample
2. a lot of training cases Figure not included in this excerpt
3. A figure of quality is not included in this excerpt

The aim is now to find a Bayesian network [figure not included in this sample] (i.e. a structure [figure not included in this sample] and corresponding conditional probability tables figure not included in this sample) that the given training cases [ Figure not included in this excerpt] with regard to the quality measure [Figure not included in this excerpt] optimally explained. The individual training cases [illustration not included in this reading sample] have the form Illustration not included in this reading sample, whereby illustration is not included in this reading sample, so they can also be incomplete.

The Bayesian approach is based on the following Bayesian theorem:

Theorem 3.1.1 (Bayesian Theorem): Figure not included in this reading sample.

This relationship follows directly from the product rule defined in Section 2.2 and a symmetry argument:

Figure not included in this excerpt. Division by Figure not included in this excerpt delivers the sentence.

Bayes' theorem is particularly useful when one diagnostic Probabilities (probabilities of the form figure not included in this reading sample) of causal Wants to derive probabilities (figure not included in this excerpt), because the latter are often easier to determine.

In machine learning, one is interested in the a posteriori probability distribution of the models Figure not included in this excerpt, which can be calculated with Bayes' theorem with the help of the so-called Likelihood the training data Figure not included in this excerpt, the a priori probability distribution Figure not included in this excerpt, and the probability of the training data figure not included in this reading sample. The likelihood here indicates the probability of the occurrence of the data [illustration not included in this reading sample] for a given model [illustration not included in this reading sample].

Note that the subjective factor here is in the form of the a priori probability distribution Figure not included in this excerpt that describes nothing other than the subjective probabilities for the possible models if no empirical data are known.

Since you typically only want to determine the best model and do not need a whole probability distribution of all models, the result is:

Figure not included in this excerpt.

Due to the fact that the (a priori) probability of the training data [illustration not included in this reading sample] with regard to the determination of an optimal model [illustration not contained in this reading sample] is only a constant, one can only consider it simplify to:

Figure not included in this excerpt.

This approximation is in the literature (compare [Wit02]) as Maximum a posteriori learning (MAP -Learning).

3.2 The frequentist approach

In contrast to the Bayesian approach, the probabilities are seen here as a physical property of the domain under consideration and can thus in principle be determined with the help of random experiments that can be repeated as often as required but are independent of one another. So at no point are subjective assumptions made about certain probabilities. Therefore, one often finds the term " objective probability “In the literature [Wit02].

Typically, you have a database with experimentally determined training data and you have the same machine learning problem as previously explained in Definition 3.1.1. In contrast to the Bayesian approach, however, it avoids introducing subjective probabilities into the model. As a result, this approach is all about maximizing likelihood (hence also Maximum likelihood method) of the given training data [illustration not included in this reading sample] for the given model, illustration not included in this reading sample. The likelihood of all training data Figure not included in this excerpt is calculated as the product of the likelihoods of individual training cases:

Figure not included in this excerpt.

An optimal model can then be described as an element of the set of models Figure not included in this excerpt, which among all possible models has the expression Figure not included in this excerpt maximized:

Figure not included in this excerpt.

For technical reasons, one often goes to the so-called Log likelihood over and maximize this:

Figure not included in this excerpt.

With the frequentist approach, a subjectively determined factor like in the Bayesian approach is not included at any point. Rather, in the learning process, the model is determined which has the greatest probability of having generated it on the basis of the given training data.

Finally, it should be mentioned that the maximum a-posteriori learning and the maximum likelihood method converge against each other with an increasing number of training cases, which is ultimately due to the decreasing influence of the subjective factor in MAP learning with an increasing number of training cases.

Chapter 4 Machine Learning Bayesian Networks

In many cases one is able to fall back on databases, with the help of which a Bayesian network can finally be set up according to the approaches presented in the previous chapter. Figure 4.1 illustrates the transformation of a so-called case database into a Bayesian network.

Figure not included in this excerpt

Figure 4.1: Learning Bayesian networks using databases

In the literature (compare [BKr02], [Bor00] and [Ste03]) this area falls under the term "machine learning Bayesian networks" (using a given database) and leads to the central topic of the present work.

The construction process of a Bayesian network is similar to that shown in Figure 3.1.1, but machine learning processes are used in two places, namely when defining the network structure (phase 2) and specifying the conditional probabilities (phase 3).

4.1 Learning situations

According to definition 2.2.1, a Bayesian network consists of two components, on the one hand from the graph topology (network structure) and on the other hand from the parameters of the respective conditional probability tables induced by the network structure. It is possible to learn both components on the basis of given data. Some authors (see for example [BKr02]) also differentiate in this context between learning the global structure and learning the local structure. The global structure relates to the topology of the network, while learning the local structure involves setting the parameters of the CPTs. A distinction must be made here as to whether the training data available Completely or incomplete are. Incomplete training data is always (but not exclusively) available when it hid Variables in the network. For example, in medical diagnosis it is often the case that certain pre-factors and some symptoms can be observed, but the actual disease cannot be observed. Of course, this variable could then simply be omitted, but the fact that on the one hand one is often interested in the value of these hidden variables and on the other hand these hidden nodes typically the complexity (measured by the number of conditional probabilities required to specify the CPTs) of the Bayesian network can significantly reduce.

Depending on whether the structure of the Bayesian network to be learned is already known or specified by a group of experts or not, the following learning situations can be distinguished:

1. Known structure and complete training data: This is the simplest of all cases, because with the help of the complete training data, the conditional probability tables already predetermined by the structure [illustration not included in this reading sample] need to be filled with parameters (the corresponding conditional probabilities). There are analytical solutions to this problem for both the frequentist and Bayesian approaches, which also appear as a sub-problem in the other learning situations still to be discussed.

2. Known structure and incomplete training data: Here, too, the problem of learning the conditional probabilities arises in the CPTs mapping already given (due to the specified structure) not included in this excerpt] are incomplete. That means there are no observations for some of the variables of the Bayesian network. In principle, it is possible to discard these special training cases and to build up the CPTs on the basis of the complete training cases according to 1. However, in this way one may discard a large proportion of valuable training data that is not included in the model creation. Therefore, there are various methods that also include the incomplete training cases.One of the most famous here is the so-called Expectation Maximization Method, compare for example [DLR77].

3. Unknown structure and complete training data: The main problem here is the definition of the network structure; in the case of a fixed topology, the sub-problem of the specification of the parameters of the CPTs can then be solved with the analytical approaches mentioned in 1. There are three different approaches. In the approach most frequently cited in the literature - since it is the most promising - one first determines a quality measure for the evaluation of a specified network structure and then carries out a local search in the space of all possible network structures. A wide variety of search strategies are used here, whereby an exhaustive search is forbidden due to the exponential number of different directed acyclic graphs with an increasing number of nodes. This is why heuristic methods such as gradient-based Greedy Hill climbing are often used in order to find at least a local optimum (with regard to the quality measure) in the space of the network structures. Simulated annealing is an alternative way of determining the global optimum, provided that the temperature is reduced slowly enough and enough computing time is allowed for.

4. Unknown structure and incomplete training data: This case represents the most complex learning situation, because not only do the network structure and the corresponding parameters of the CPTs have to be determined, but there is also the problem that the training data is incomplete. An extension of the already mentioned expectation maximization algorithm, the so-called structural EM algorithm crystallized out. There are variants for the frequentist as well as for the Bayesian approach, they have in common the alternating improvement of the structure and the parameters of the associated conditional probability tables (see also [Fri98]).

4.2 Structural learning of Bayesian networks

Since analytical procedures for learning the conditional probabilities in the conditional probability tables already exist with a known structure, the main focus of the present work is on learning the structure of a Bayesian network using a case database.

As already outlined in the introduction, this is a high-dimensional search problem for which a wide variety of possible solutions are currently being studied. However, one should not expect efficient exact algorithms in this area, because some sub-problems in the area of ​​structure learning Bayesian networks are already known as NP-complete, for example the finding of the highest rated legal network structures with the help of a scoring function (cf. [Chi96]).

A distinction is made between the following three approaches to structural learning:

- Local Score Metrics: The structure learning of a Bayesian network [figure not included in this reading sample] is regarded here as an optimization problem, whereby a so-called Quality measure [Figure not included in this excerpt] is to be maximized for given training data [Figure not included in this excerpt]. The decisive advantage of local score metrics is that the score of the entire network structure can be split up as the sum or product of individual local scores for certain substructures of the Bayesian network. This enables the use of local search strategies such as hill climbing or simulated annealing. Furthermore, scores of substructures that occur when traversing the search area can be temporarily stored in a cache and used again for later evaluations of similar network structures. This leads to considerable savings in computing time, which in the case of Global Score metrics have to be dispensed with.

- Global Score Metrics: With this approach, too, a given network structure is assessed with the aid of a quality measure. The decisive difference to the local score metrics is the fact that the total score cannot be broken down into individual summands or factors, i.e. local substructures cannot be assessed and therefore cannot be used for later, faster calculations of similar network structures. Instead, the performance of a Bayesian network is measured by the ability to predict data sets that do not occur in the training set. Typically, the amount of training data is repeatedly divided into training sets and test sets and then a cross validation is carried out in order to estimate the quality of an entire network based on the prediction accuracy of the network learned on the basis of the training set. The average performance of a Bayesian network under consideration over a test set induces a metric for the quality of a network. As in the case of local score metrics, this metric can then be used to traverse the search space supported by a heuristic search strategy such as hill climbing or simulated annealing.

- Conditional Independence Tests: This approach is based on discovering (conditional) independence between variables in the case database given for training. It is assumed that a directed acyclic graph exists which exactly reflects the (conditional) independence in the training data. The aim is therefore to find exactly the Bayesian network, the structure of which encodes the conditional independence in the distribution that generated the training data. Through various tests to verify the (conditional) independence between several variables of the domain under consideration, the (non-) insertion of edges between them is then determined. Of course, with training data obtained in practice, one cannot assume that exact (conditional) independence can be discovered; rather, one specifies a parameter [illustration not included in this reading sample] that determines how independent two variables must be from one another no edge is inserted between them. If the locations of the edges are found, they are oriented in such a way that they optimally represent the uncovered (conditional) independence in the training data.

In the further course of the present work, procedures in the area of ​​local score metrics will be discussed in more detail. All methods for structural learning of Bayesian networks in this area have the following components in common:

- A Quality measure to evaluate the quality of a given network structure
- One Search strategy in the space of possible network structures

These two components are discussed in detail in the following subsections.

[...]

End of the reading sample from 151 pages