# Full text of "mit :: ai :: aim :: AIM-1570"

## See other formats

MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY A.I. Memo No. 1570 August 1996 Mean Field Theory for Sigmoid Belief Networks Lawrence K. Saul, Tommi Jaakkola, and Michael I. Jordan This publication can be retrieved by anonymous ftp to publications.ai.mit.edu. Abstract We develop a mean field theory for sigmoid belief networks based on ideas from statistical mechanics. Our mean field theory provides a tractable approximation to the true probability distribution in these networks; it also yields a lower bound on the likelihood of evidence. We demonstrate the utility of this framework on a benchmark problem in statistical pattern recognition — the classification of handwritten digits. Copyright © Massachusetts Institute of Technology, 1996 This report describes research done at the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. Support for this research was provided in part by the Advanced Research Projects Agency of the Department of Defense under Office of Naval Research contract N00014-94-1-0777. The authors were also supported by NSF grant CDA-9404932, ATR Research Laboratories, and Siemens Corporation. 1 Introduction Bayesian belief networks (Pearl, 1988; Lauritzen & Spiegelhalter, 1988) provide a rich graphical represen- tation of probabilistic models. The nodes in these net- works represent random variables, while the links rep- resent causal influences. These associations endow di- rected acyclic graphs (DAGs) with a precise probabilis- tic semantics. The ease of interpretation afforded by this semantics explains the growing appeal of belief networks, now widely used as models of planning, reasoning, and uncertainty. Inference and learning in belief networks are possible insofar as one can efficiently compute (or approximate) the likelihood of observed patterns of evidence (Bun- tine, 1994; Russell, Binder, Koller, & Kanazawa, 1995). There exist provably efficient algorithms for computing likelihoods in belief networks with tree or chain-like ar- chitectures. In practice, these algorithms also tend to perform well on more general sparse networks. However, for networks in which nodes have many parents, the ex- act algorithms are too slow (Jensen, Kong, & Kjaefulff, 1995). Indeed, in large networks with dense or layered connectivity, exact methods are intractable as they re- quire summing over an exponentially large number of hidden states. One approach to dealing with such networks has been to use Gibbs sampling (Pearl, 1988), a stochastic simu- lation methodology with roots in statistical mechanics (Geman & Geman, 1984). Our approach in this pa- per relies on a different tool from statistical mechanics — namely, mean field theory (Parisi, 1988). The mean field approximation is well known for probabilistic models that can be represented as undirected graphs — so-called Markov networks. For example, in Boltzmann machines (Ackley, Hinton, & Sejnowski, 1985), mean field learn- ing rules have been shown to yield tremendous savings in time and computation over sampling-based methods (Peterson & Anderson, 1987). The main motivation for this work was to extend the mean field approximation for undirected graphical mod- els to their directed counterparts. Since belief networks can be transformed to Markov networks, and mean field theories for Markov networks are well known, it is natu- ral to ask why a new framework is required at all. The reason is that probabilistic models which have compact representations as DAGs may have unwieldy represen- tations as undirected graphs. As we shall see, avoiding this complexity and working directly on DAGs requires an extension of existing methods. In this paper we focus on sigmoid belief networks (Neal, 1992), for which the resulting mean field theory is most straightforward. These are networks of binary random variables whose local conditional distributions are based on log-linear models. We develop a mean field approximation for these networks and use it to compute a lower bound on the likelihood of evidence. Our method applies to arbitrary partial instantiations of the variables in these networks and makes no restrictions on the net- work topology. Note that once a lower bound is available, a learning procedure can maximize the lower bound; this is useful when the true likelihood itself cannot be com- puted efficiently. A similar approximation for models of continous random variables is discussed by Jaakkola et al (1995). The idea of bounding the likelihood in sigmoid belief networks was introduced in a related architecture known as the Helmholtz machine (Hinton, Dayan, Frey, & Neal 1995). A fundamental advance of this work was to es- tablish a framework for approximation that is especially conducive to learning the parameters of layered belief networks. The close connection between this idea and the mean field approximation from statistical mechan- ics, however, was not developed. In this paper we hope not only to elucidate this con- nection, but also to convey a sense of which approxima- tions are likely to generate useful lower bounds while, at the same time, remaining analytically tractable. We develop here what is perhaps the simplest such approx- imation for belief networks, noting that more sophisti- cated methods (Jaakkola & Jordan, 1996a; Saul & Jor- dan, 1995) are also available. It should be emphasized that approximations of some form are required to handle the multilayer neural networks used in statistical pattern recognition. For these networks, exact algorithms are hopelessly intractable; moreover, Gibbs sampling meth- ods are impractically slow. The organization of this paper is as follows. Section 2 introduces the problems of inference and learning in sig- moid belief networks. Section 3 contains the main contri- bution of the paper: a tractable mean field theory. Here we present the mean field approximation for sigmoid be- lief networks and derive a lower bound on the likelihood of instantiated patterns of evidence. Section 4 looks at a mean field algorithm for learning the parameters of sigmoid belief networks. For this algorithm, we give re- sults on a benchmark problem in pattern recognition — the classification of handwritten digits. Finally, section 5 presents our conclusions, as well as future issues for re- search. 2 Sigmoid Belief Networks The great virtue of belief networks is that they clearly exhibit the conditional dependencies of the underlying probability model. Consider a belief network defined over binary random variables S = (Si , S2, ■ ■ ■ , £jv)- We denote the parents of Si by pa(S;) C {Si,S2, • • . S;_i}; this is the smallest set of nodes for which P(5,-|5i,5 2 ,...,5,-_i) = P(5,-|pa(5,-)). (1) In sigmoid belief networks (Neal, 1992), the conditional distributions attached to each node are based on log- linear models. In particular, the probability that the ith node is activated is given by P(Si = l|pa(S;)) = a '^JijSj + h{ (2) where J 8 j and hi are the weights and biases in the net- work, and a(z) = — ^ (3) where p(v) = J2p(h,v) (8) Figure 1: Sigmoid function u{z) = [1 + e~ z ]~ l . If z is the sum of weighted inputs to node S, then P(S = l|z) = t(z) is the conditional probability that node S is activated. is the sigmoid function shown in Figure 1. In sigmoid belief networks, we have Jij = for Sj (fi pa(5 , 8 ); more- over, Jij = for j > i since the network's structure is that of a directed acyclic graph. The sigmoid function in eq. (2) provides a com- pact parametrization of the conditional probability distributions 1 in eq. (2) used to propagate beliefs. In particular, P(S;|pa(S;)) depends on pa(5 , 8 ) only through a sum of weighted inputs, where the weights may be viewed as the parameters in a logistic regression (Mc- Cullagh & Nelder, 1983). The conditional probability distribution for Si may be summarized as: ex P \\J2i J ij s j + h i) Si\ P(5,-|pa(5,-)) = ^-} J ~^- (4) 1 + exp |£\ JijSj + hA Note that substituting Si = 1 in eq. (4) recovers the result in eq. (2). Combining eqs. (1) and (4), we may write the joint probability distribution over the variables in the network as: P(S) Y[p(s t \MSi)) (5) T-, f ex P (Ej JijSj + hi) SA ] = III — M H~\- (6) i [ 1 + exp \J2j JijSj +hA J The denominator in eq. (6) ensures that the probability distribution is normalized to unity. We now turn to the problem of inference in sigmoid belief networks. Absorbing evidence divides the units in the belief network into two types, visible and hid- den. The visible units (or "evidence nodes") are those for which we have instantiated values; the hidden units are those for which we do not. When there is no possible ambiguity, we will use H and V to denote the subsets of hidden and visible units. Using Bayes' rule, inference is done under the conditional distribution P(H\V) P(H, V) P(V) ■ (7) H J The relation to noisy-OR models is discussed in ap- pendix A. is the likelihood of the evidence V . In principle, the like- lihood may be computed by summing over all 2' H ' config- urations of the hidden units. Unfortunately, this calcula- tion is intractable in large, densely connected networks. This intractability presents a major obstacle to learning parameters for these networks, as nearly all procedures for statistical estimation require frequent estimates of the likelihood. The calculations for exact probabilistic inference are beset by the same difficulties. Unable to compute P(V) or work directly with P(H\V), we will re- sort to an approximation from statistical physics known as mean field theory. 3 Mean Field Theory The mean field approximation appears under a multi- tude of guises in the physics literature; indeed, it is "almost as old as statistical mechanics" (Itzykson & Drouffe, 1991). Let us briefly explain how it acquired its name and why it is so ubiquitous. In the physical models described by Markov networks, the variables Si represent localized magnetic moments (e.g., at the sites of a crystal lattice), and the sums ^. JijSj+hi represent local magnetic fields. Roughly speaking, in certain cases a central limit theorem may be applied to these sums, and a useful approximation is to ignore the fluctuations in these fields and replace them by their mean value — hence the name, "mean field" theory. In some models, this is an excellent approximation; in others, a poor one. Because of its simplicity, however, it is widely used as a first step in understanding many types of physical phe- nomena. Though this explains the philological origins of mean field theory, there are in fact many ways to derive what amounts to the same approximation (Parisi, 1988). In this paper we present the formulation most appropriate for inference and learning in graphical models. In partic- ular, we view mean field theory as a principled method for approximating an intractable graphical model by a tractable one. This is done via a variational principle that chooses the parameters of the tractable model to minimize an entropic measure of error. The basic framework of mean field theory remains the same for directed graphs, though we have found it neces- sary to introduce extra mean field parameters in addition to the usual ones. As in Markov networks, one finds a set of nonlinear equations for the mean field parameters that can be solved by iteration. In practice, we have found this iteration to converge fairly quickly and to scale well to large networks. Let us now return to the problem posed at the end of the last section. There we found that for many be- lief networks, it was intractable to decompose the joint distribution as P(S) = P(H\V)P(V), where P(V) was the likelihood of the evidence V . For the purposes of probabilistic modeling, mean field theory has two main virtues. First, it provides a tractable approximation, Q(H\V) ps P(H\V), to the conditional distributions re- quired for inference. Second, it provides a lower bound on the likelihoods required for learning. Let us first consider the origin of the lower bound. Clearly, for any approximating distribution Q(H\V), we have the equality: InP(V) In Y,P(H,V) H ln£Q(ff|V0 H P(H, V) Q(H\V) (9) (10) To obtain a lower bound, we now apply Jensen's in- equality (Cover & Thomas, 1991), pushing the logarithm through the sum over hidden states and into the expec- tation: It is straightforward to verify that the difference between the left and right hand side of eq. (11) is the Kullback- Leibler divergence (Cover & Thomas, 1991): -Q(H\V) (11) KL(Q\\P) H Q{H\V)\n P(H\V) (12) Thus, the better the approximation to P(H\V), the tighter the bound on InP(V). Anticipating the connection to statistical mechanics, we will refer to Q(H\V) as the mean field distribution. It is natural to divide the calculation of the bound into two components, both of which are particular averages over this approximating distribution. These components are the mean field entropy and energy; the overall bound is given by their difference: InP(V) > J2Q(H\V)lnQ(H\V)) (13) H J2Q(H\V)\nP(H,V) \ H / Both terms have physical interpretations. The first mea- sures the amount of uncertainty in the mean-field dis- tribution and follows the standard definition of entropy. The second measures the average value 2 of — In P(H, V); the name "energy" arises from interpreting the prob- ability distributions in belief networks as Boltzmann distributions at unit temperature. In this the ergy of each network configuration is given (up to a con- stant) by minus the logarithm of its probability under 2 A similar average is performed in the E-step of an EM algorithm (Dempster, Laird, & Rubin, 1977); the difference here is that the average is performed over the mean field dis- tribution, Q(H\V), rather than the true posterior, P(H\V). For a related discussion, see Neal & Hinton (1993). Our terminology is as follows. Let S denote the degrees of freedom in a statistical mechanical system. The energy of the system, £(S), is a real- valued function of these degrees of freedom, and the Boltzmann distribution P(S) -I3£(S) Zs -I3£(S) defines a probability distribution over the possible configu- the Boltzmann distribution. In sigmoid belief networks, the energy has the form lnP(H,V) / J 'J ij ^>i 3j y j 11% Oi (14) E ln 1 + ex P X] Ji i ^i + hi as follows from eq. (6). The first two terms in this equa- tion are familiar from Markov networks with pairwise in- teractions (Hertz, Krogh, & Palmer, 1991); the last term is peculiar to sigmoid belief networks. Note that the overall energy is neither a linear function of the weights nor a polynomial function of the units. This is the price we pay in sigmoid belief networks for identifying P(H \V) as a Boltzmann distribution and the log-likelihood P(V) as its partition function. Note that this identification was made implicitly in the form of eqs. (7) and (8). The bound in eq. (11) is valid for any probability dis- tribution Q(H\V). To make use of it, however, we must choose a distribution that enables us to evaluate the right hand side of eq. (11). Consider the factorized distribu- tion Q(H\V)=l[^(l-^) 1 - s ', (15) ieH in which the binary hidden units {Si}i^n appear as in- dependent Bernoulli variables with adjustable means Hi. A mean field approximation is obtained by substituting the factorized distribution, eq. (15), for the true Boltz- mann distribution, eq. (7). It may seem that this ap- proximation replaces the rich probabilistic dependencies in P(H \V) by an impoverished assumption of complete factorizability. Though this is true to some degree, the reader should keep in mind that the values we choose for {/"ijiei? (and hence the statistics of the hidden units) will depend on the evidence V . The best approximation of the form, eq. (15), is found by choosing the mean values, {/j,i}i^H, that minimize the Kullback-Leibler divergence, A'_L(Q||_P). This is equivalent to minimizing the gap between the true log- likelihood, InP(V), and the lower bound obtained from mean field theory. The mean field bound on the log- likelihood may be calculated by substituting eq. (15) into the right hand side of eq. (11). The result of this calcu- lation is lnP(V) > Y, J H ViVj+Y, h < Hi (16) £0* 1 i Yl ■ JijSj+hi 1 + e^i (17) 'Y^\p,i\nm + (1 - fi{)\n(l -m)} where (•) indicates an expectation value over the mean field distribution, eq. (15). The terms in the first two rations of S. The parameter fi is the inverse temperature; it serves to calibrate the energy scale and will be fixed to unity in our discussion of belief networks. Finally, the sum in the denominator — known as the partition function — ensures that the Boltzmann distribution is normalized to unity. lines of eq. (16) represent the mean field energy, derived from eq. (15); those in the third represent the mean field entropy. In a slight abuse of notation, we have defined mean values //; for the visible units; these of course are set to the instantiated values //; £ {0, 1}. Note that to compute the average energy in the mean field approximation, we must find the expected value of (In [1 + e z ']}, where Z{ = ^\- JijSj + hi is the sum of weighted inputs to the ith unit in the belief network. Un- fortunately, even under the mean field assumption that the hidden units are uncorrelated, this average does not have a simple closed form. This term does not arise in the mean field theory for Markov networks with pair- wise interactions; again, it is peculiar to sigmoid belief networks. In principal, the average may be performed by enu- merating the possible states of pa(5 , 8 ). The result of this calculation, however, would be an extremely unwieldy function of the parameters in the belief network. This re- flects the fact that in general, the sigmoid belief network defined by the weights Jij has an equivalent Markov net- work with Nth order interactions and not pairwise ones. To avoid this complexity, we must develop a mean field theory that works directly on DAGs. How we handle the expected value of (In [1 + e z ']} is what distinguishes our mean field theory from previous ones. Unable to compute this term exactly, we resort to another bound. Note that for any random variable z and any real number £, we have the equality: (ln[l + e*]) = (ln[e^e-^(l + e')]> (18) = Z( z ) + (]n.[e-t* + e^-V*]). (19) We can upper bound the right hand side by applying Jensen's inequality in the opposite direction as before, pulling the logarithm outside the expectation: (ln[l + e z ]) < £(z) + In (e~ (z + e^~^ z \ (20) Setting £ = in eq. (20) gives the standard bound: (ln(l + e z )) < ln(l + e*). A tighter bound (Seung, 1995) can be obtained, however, by allowing non-zero values of £. This is illustrated in Figure 2 for the special case where z is a Gaussian distributed random variable with zero mean and unit variance. The bound in eq. (20) has two useful properties which we state here without proof: (i) the right hand side is a convex function of £; (ii) the value of £ which minimizes this function occurs in the interval £ £ [0, 1]. Thus, provided it is possible to evalu- ate eq. (20) for different values of £, the tightest bound of this form can be found by a simple one-dimensional minimization. The above bound can be put to immediate use by attaching an extra mean field parameter £; to each unit in the belief network. We can then upper bound the intractable terms in the mean field energy by In 1 + e J2, J '3 S 3+ h ' < 6 In ' Z^ JijVj + hi (24) + e (i-e,>, 0.95 0.85 exact 0.75„ L 0.2 0.4 0.6 Figure 2: Bound in eq. (19) for the case where z is nor- mally distributed with zero mean and unit variance. In this case, the exact result is (ln(l + e z )) = 0.806; the bound gives min^ < ln[e~£ -|-e2( 1- £) H = 0.818. The standard bound from Jensen's inequality occurs at £ = and gives 0.974. where Z{ = ^\- JijSj + hi. The expectations inside the logarithm can be evaluated exactly for the factorial dis- tribution, eq. (15); for example, i P -ti*i\ - P ~Uh, TT (, , „. P -tiJi. na fij + Hje~ (22) A similar result holds for (e^ 1- ^^ 1 ). Though these av- erages are tractable, we will tend not to write them out in what follows. The reader, however, should keep in mind that these averages do not present any difficulty; they are simply averages over products of independent random variables, as opposed to sums. Assembling the terms in eqs. (16) and (21) gives a lower bound InP(V) > Cy , A £*. m/j-j -5> i + X] [/■*«' ln ^ ! ' + ( 1 ~ ^i) ln(l - A*i)] ; i on the log-likelihood of the evidence V . So far we have not specified the parameters {ni}i£H an d {&}', in par- ticular, the bound in eq. (23) is valid for any choice of parameters. We naturally seek the values that maxi- mize the right hand side of eq. (23). Suppose we fix the mean values {ni}i^H an d ask for the parameters {£;} that yield the tightest possible bound. Note that the right hand side of eq. (23) does not couple terms with £i that belong to different units in the network. The min- imization over {£;} therefore reduces to N independent minimizations over the interval [0, 1]. These can be done by any number of standard methods (Press, Flannery, Teukolsky, k Vetterling, 1986). To choose the means, we set the gradients of the bound with respect to {ni}i£H equal to zero. To this Figure 3: The Markov blanket of unit Si includes its parents and children, as well as the other parents of its children. Kii ■lA<- ( " „(i-eo* (24) end, let us define the intermediate matrix: d_ where Z{ is the weighted sum of inputs to ith unit. Note that Kij is zero unless Sj is a parent of Sf, in other words, it has the same connectivity as the weight matrix Jij. Within the mean field approximation, Kij measures the parental influence of Sj on Si given the instantiated evidence V . The degree of correlation (positive or nega- tive) is measured relative to the other parents of Si. The matrix elements of K may be evaluated by ex- panding the expectations as in eq. (22); a full derivation is given in appendix B. Setting the gradient dCy/d/J-i equal to zero gives the final mean field equation: A*i J2 \- J iJH + J 3i(H -Zj) + K i (25) where c(-) is the sigmoid function. The argument of the sigmoid function may be viewed as an effective input to the ith unit in the belief network. This effective input is composed of terms from the unit's Markov blanket (Pearl, 1988), shown in Figure 3; in particular, these terms take into account the unit's internal bias, the val- ues of its parents and children, and, through the matrix Kji, the values of its children's other parents. In solving these equations by iteration, the values of the instanti- ated units are propagated throughout the entire network. An analogous propagation of information occurs in exact algorithms (Lauritzen & Spiegelhalter, 1988) to compute likelihoods in belief networks. While the factorized approximation to the true poste- rior is not exact, the mean field equations set the param- eters {ni~}i£H to values which make the approximation as accurate as possible. This in turn translates into the tightest mean field bound on the log-likelihood. The overall procedure for bounding the log-likelihood thus consists of two alternating steps: (i) update {£;} for fixed {/!{}; (ii) update {ni}i^H for fixed {&}. The first step involves N independent minimizations over the in- terval [0, 1]; the second is done by iterating the mean field equations. In practice, the steps are repeated until the mean field bound on the log-likelihood converges 4 to a desired degree of accuracy. Figure 4: Three layer belief network (2x4x6) with top- down propagation of beliefs. To model the images of handwritten digits in section 4, we used 8x24x64 net- works where units in the bottom layer encoded pixel values in 8x8 bitmaps. The quality of the bound depends on two approxi- mations: the complete factorizability of the mean field distribution, eq. (15), and the logarithm bound, eq. (20). How reliable are these approximations in belief net- works? To study this question, we performed numerical experiments on the three layer belief network shown in Figure 4. The advantage of working with such a small network (2x4x6) is that true likelihoods can be com- puted by exact enumeration. We considered the par- ticular event that all the units in the bottom layer were instantiated to zero. For this event, we compared the mean field bound on the likelihood to its true value, ob- tained by enumerating the states in the top two layers. This was done for 10000 random networks whose weights and biases were uniformly distributed between -1 and 1. Figure 5 (left) shows the histogram of the relative er- ror in log likelihood, computed as £y/ln_P(I/) — 1; for these networks, the mean relative error is 1.6%. Figure 5 (right) shows the histogram that results from assuming that all states in the bottom layer occur with equal prob- ability; in this case the relative error was computed as (ln2 -6 )/ In P(V) — 1. For this "uniform" approximation, the root mean square relative error is 22.6%. The large discrepancy between these results suggests that mean field theory can provide a useful lower bound on the likelihood in certain belief networks. Of course, what ultimately matters is the behavior of mean field theory in networks that solve meaningful problems. This is the subject of the next section. 4 Learning One attractive use of sigmoid belief networks is to perform density estimation in high dimensional input spaces. This is a problem in parameter estimation: given a set of patterns over particular units in the belief net- work, find the set of weights Jij and biases hi that assign high probability to these patterns. Clearly, the ability to compute likelihoods lies at the crux of any algorithm for learning the parameters in belief networks. Mean field algorithms provide a strategy for discov- ering appropriate values of Jij and hi without resort to It can be shown that asychronous updates of the mean field parameters lead to monotonic increases in the lower bound (just as in the case of Markov networks). mean field approximation 0.01 0.02 0.03 0.04 0.05 0.06 0.07 relative error in log-likelihood uniform approximation 4500 4000 3500 3000 2500 2000 1500 1000 500 °< 4000 1— 3500 3000 2500 2000 1500 1000 500- -1 -0.5 0.5 1 1.5 relative error in log-likelihood Figure 5: Histograms of relative error in log-likelihood over 10000 randomly generated three layer networks. On top: the relative error from the mean field approxima- tion; on bottom: the relative error if all states in the bottom layer are assumed to occur with equal probabil- ity. The log-likelihood was computed for the event that the all the nodes in the bottom layer were instantiated to zero. Gibbs sampling. Consider, for instance, the following procedure. For each pattern in the training set, solve the mean field equations for {ni,£i} and compute the associated bound on the log-likelihood, Cy . Next, adapt the weights in the belief network by gradient ascent 5 in the mean field bound, A,L Ahi dJij 3Cy dhi (26) (27) where r\ is a suitably chosen learning rate. Finally, cy- cle through the patterns in the training set, maximiz- ing their likelihoods 6 for a fixed number of iterations or until one detects the onset of overfitting (e.g., by cross- validation). The above procedure uses a lower bound on the log- likelihood as a cost function for training belief networks (Hinton, Dayan, Frey, & Neal, 1995). The fact that we have a lower bound on the log-likelihood, rather than an upper bound, is of course crucial to the success of this learning algorithm. Adjusting the weights to maxi- mize this lower bound can affect the true log-likelihood in two ways (see Figure 6). Either the true log-likelihood Expressions for the gradients of Cv are given in the ap- pendix B. Of course, one can also incorporate prior distributions over the weights and biases and maximize an approximation to the log posterior probability of the training set. -~ -___ Irue log-likelihood Figure 6: Relationship between the true log-likelihood and its lower bound during learning. One possibility (at left) is that both increase together. The other is that the true log-likelihood decreases, closing the gap between itself and the bound. The latter can be viewed as a form of regularization. Figure 7: Binary images of handwritten digits: two and five. increases, moving in the same direction as the bound, or the true log-likelihood decreases, closing the gap between these two quantities. For the purposes of maximum like- lihood estimation, the first outcome is clearly desirable; the second, though less desirable, can also be viewed in a positive light. In this case, the mean field approxi- mation is acting as a regularizer, steering the network toward simple, factorial solutions even at the expense of lower likelihood estimates. We tested this algorithm by building a maximum- likelihood classifier for images of handwritten digits. The data consisted of 11000 examples of handwritten digits [0-9] compiled by the U.S. Postal Service Office of Ad- vanced Technology. The examples were preprocessed to produce 8x8 binary images, as shown in Figure 7. For each digit, we divided the available data into a training set with 700 examples and a test set with 400 examples. We then trained a three layer network 7 (see Figure 4) on each digit, sweeping through each training set five times with learning rate r\ = 0.05. The networks had 8 units in the top layers, 24 units in the middle layer, and 64 units in the bottom layer, making them far too large to be treated with exact methods. After training, we classified the digits in each test set by the network that assigned them the highest likeli- hood. Table 1 shows the confusion matrix in which the ijth entry counts the number of times digit i was clas- sified as digit j. There were 184 errors in classification There are many possible architectures that could be cho- sen for the purpose of density estimation; we used layered networks to permit a comparison with previous benchmarks on this data set. 1 2 3 4 5 6 7 8 9 388 2 2 1 3 4 1 393 1 6 2 1 2 376 1 3 4 13 3 2 4 373 12 6 3 4 2 383 1 2 2 10 5 2 1 13 377 2 4 1 6 1 4 2 1 6 386 7 1 388 3 8 8 1 9 1 7 7 1 1 369 4 9 4 8 5 383 Table 1: Confusion matrix for digit classification. The entry in the ith row and jth column counts the number of times that digit i was classified as digit j. algorithm classification error nearest neighbor back-propagation wake-sleep mean field 6.7% 5.6% 4.8% 4.6% Table 2: Classification error rates for the data set of handwritten digits. The first three were reported by Hinton et al (1995). (out of a possible 4000), yielding an overall error rate of 4.6%. Table 2 gives the performance of various other al- gorithms on the same partition of this data set. Table 3 shows the average log-likelihood score of each network on the digits in its test set. (Note that these scores are actually lower bounds.) These scores are normalized so that a network with zero weights and biases (i.e., one in which all 8x8 patterns are equally likely) would receive a score of -1. As expected, digits with relatively sim- ple constructions (e.g., zeros, ones, and sevens) are more easily modeled than the rest. Both measures of performance — error rate and log- likelihood score — are competitive with previously pub- lished results (Hinton, Dayan, Frey, & Neal, 1995) on this data set. The success of the algorithm affirms both the strategy of maximizing a lower bound and the util- ity of the mean field approximation. Though similar results can be obtained via Gibbs sampling, this seems to require considerably more computation than meth- ods based on maximizing a lower bound (Frey, Dayan, & Hinton, 1995). 5 Discussion Endowing networks with probabilistic semantics pro- vides a unified framework for incorporating prior knowl- edge, handling missing data, and performing inference under uncertainty. Probabilistic calculations, however, can quickly become intractable, so it is important to de- velop techniques that approximate probability distribu- tions in a flexible manner. This is especially true for net- works with multilayer architectures and large numbers of hidden units. Exact algorithms and Gibbs sampling methods are not generally practical for such networks; digit log-likelihood score -0.447 1 -0.296 2 -0.636 3 -0.583 4 -0.574 5 -0.565 6 -0.515 7 -0.434 8 -0.569 9 -0.495 all -0.511 Table 3: Normalized log-likelihood score for each net- work on the digits in its test set. To obtain the raw score, multiply by 400 x 64 x In 2. The last row shows the score averaged across all digits. approximations are required. In this paper we have developed a mean field approx- imation for sigmoid belief networks. As a computational tool, our mean field theory has two main virtues: first, it provides a tractable approximation to the conditional distributions required for inference; second, it provides a lower bound on the likelihoods required for learning. The problem of computing exact likelihoods in belief networks is NP-hard (Cooper, 1990); the same is true for approximating likelihoods to within a guaranteed degree of accuracy (Dagum & Luby, 1993). It follows that one cannot establish universal guarantees for the accuracy of the mean field approximation. For certain networks, clearly, the mean field approximation is bound to fail — it cannot capture logical constraints or strong correlations between fluctuating units. Our preliminary results, how- ever, suggest that these worst-case results do not apply to all belief networks. It is worth noting, moreover, that all the above qualifications apply to Markov networks, and that in this domain, mean field methods are already well-established. The idea of bounding the likelihood in sigmoid be- lief networks was introduced in a related architecture known as the Helmholtz machine (Hinton, Dayan, Neal, & Zemel, 1995). The formalism in this paper differs in a number of respects from the Helmholtz machine. Most importantly, it enables one to compute a rigor- ous lower bound on the likelihood. This cannot be said for the wake-sleep algorithm (Frey, Hinton, & Dayan, 1995), which relies on sampling-based methods, or the heuristic approximation of Dayan et al (1995), which does not guarantee a rigorous lower bound. Also, our mean field theory — which takes the place of the "recog- nition model" of the Helmholtz machine — applies gener- ally to sigmoid belief networks with or without layered structure. Moreover, it places no restrictions on the lo- cations of visible units; they may occur anywhere within the network — an important feature for handling prob- lems with missing data. Of course, these advantages are not accrued without extra computational demands and more complicated learning rules. In recent work that builds on the theory presented here, we have begun to relax the assumption of com- plete factorizability in eq. (15). In general, one would expect more sophisticated approximations to the Boltz- mann distribution to yield tighter bounds on the log- likelihood. The challenge here is to find distributions that allow for correlations between hidden units while remaining computationally tractable. By tractable, we mean that the choice of Q(H\V) must enable one to eval- uate (or at least upper bound) the right hand side of eq. (13). Extensions of this kind include mixture mod- els (Jaakkola & Jordan, 1996) and/or partially factor- ized distributions (Saul & Jordan, 1995) that exploit the presence of tractable substructures in the original network. Our approach in this paper has been to work out the simplest mean field theory that is computation- ally tractable, but clearly better results will be obtained by tailoring the approximation to the problem at hand. A Sigmoid versus Noisy-OR The semantics of the sigmoid function are similar, but not identical, to the noisy-OR gates (Pearl, 1988) more commonly found in the belief network literature. Noisy- OR gates use the weights in the network to represent independent causal events. In this case, the probability that unit Si is activated is given by P(5,- = l|pa(5,-)) = l-n( 1 -Py)' (28) where pij is the probability that Sj = 1 causes Si = 1 in the absence of all other causal events. If we define the weights of a noisy-OR belief network by 6ij = — ln(l — Pij), it follows that p(Si\pa(Si)) = p ^OijSj (29) where p(z) = l-e~ z (30) is the noisy-OR gating function. Comparing this to the sigmoid function, eq. (3), we see that both model P(5'i|pa(5'i)) as a monotonically increasing function of a sum of weighted inputs. The main difference is that in noisy-OR networks, the weights 6ij are constrained to be positive by an underlying set of probabilities, pij. Recently, Jaakkola and Jordan (1996b) have developed a mean field approximation for noisy-OR belief networks. B Gradients Here we provide expressions for the gradients that ap- pear in eqs. (24), (26) and (27). As usual, let Z{ = ^2j JijSj +hi denote the sum of inputs into unit Si. Un- der the factorial distribution, eq. (15), we can compute the averages: -i,z, D (i-e,>.\ -e.ft n \ i ~ H+w ■UJi. (31) ,(l-(i)hi n ■ fij ;+/Uj e( 1 -^)^j 3 2) For each unit in the network, let us define the quantity / e (i-e,>.\ ( e -e,^ +e (i-e,>.)' (33) Note that <f>i lies between zero and one. With this defi- nition, we can write the matrix elements in eq. (24) as: Kii (i_^.)(i_ e -e,^) ^.(i_ e (i-e,y.,) 1 — flj + PjC t' J 'J 1 — pj + pj6 i ' l ~ ( ") J <3 . ( 34 ) The gradients in eqs. (26) and (27) are found by similar means. For the weights, we have dC v dJij -(& - w)w + y (1 - (j)i)jiPje ( -' J " Pj + Pj e -ZiJi. (35) 1 - pj + p j e( 1 -t') J 'j Likewise, for the biases, we have dC v dhi ■ Pi - <f>i (36) Finally, we note that one may obtain simpler gradients at the expense of introducing a weaker bound than eq. (20). This can be advantageous when speed of computation is more important than the quality of the bound. All the experiments in this paper used the bound in eq. (20). Acknowledgements We are especially grateful to P. Dayan, G. Hinton, B. Frey, R. Neal, and H. Seung for sharing early versions of their manuscripts and for providing many stimulat- ing discussions about this work. The paper was also improved greatly by the comments of several anony- mous reviewers. To facilitate comparisons with simi- lar methods, the results reported in this paper used im- ages that were preprocessed at the University of Toronto. The authors acknowledge support from NSF grant CDA- 9404932, ONR grant N00014-94-1-0777, ATR Research Laboratories, and Siemens Corporation. References Ackley, D., Hinton, C, & Sejnowski, T. (1985) A learn- ing algorithm for Boltzmann machines. Cognitive Sci- ence, 9, 147-169. Buntine, W. (1994) Operations for learning with graph- ical models. Journal of Artificial Intelligence Research, 2, 159-225. Cooper, G. (1990) Computational complexity of proba- bilistic inference using Bayesian belief networks. Artifi- cial Intelligence, J t 2, 393-405. Cover, T., & Thomas, J. (1991) Elements of Information Theory. New York: John Wiley & Sons. Dagum, P., & Luby, M. (1993) Approximately proba- bilistic reasoning in Bayesian belief networks is NP-hard. Artificial Intelligence, 60, 141-153. Dayan, P., Hinton, G., Neal, R., k Zemel, R. (1995) The Helmholtz machine. Neural Computation, 7, 889-904. Dempster, A., Laird, N., and Rubin, D. (1977) Max- imum likelihood from incomplete data via the EM al- gorithm. Journal of the Royal Statistical Society B39, 1-38. Frey, B., Hinton, G., k Dayan, P. (1995) Does the wake- sleep algorithm learn good density estimators? In D. Touretzky, M. Mozer, and M. Hasselmo (eds). Advances of Neural Information Processing Systems: Proceedings of the 1995 Conference. Geman, S., k Geman, D. (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of im- ages. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 6, 721-741. Hertz, J., Krogh, A., and Palmer, R. G. (1991) Intro- duction to the Theory of Neural Computation. Redwood City, CA: Addison- Wesley. Hinton, G., Dayan, P., Frey, B., k Neal, R. (1995) The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158-1161. Itzykson, C, k Drouffe, J.M. (1991). Statistical Field Theory. Cambridge: Cambridge University Press. Jaakkola, T., Saul, L., k Jordan, M. (1995) Fast learn- ing by bounding likelihoods in sigmoid-type belief net- works. In D. Touretzky, M. Mozer, and M. Hasselmo (eds). Advances of Neural Information Processing Sys- tems: Proceedings of the 1995 Conference. Jaakkola, T., k Jordan, M. (1996a) Mixture model ap- proximations for belief networks. Manuscript in prepa- ration. Jaakkola, T., k Jordan, M. (1996b) Computing upper and lower bounds on likelihoods in intractable networks. Submitted. Jensen, C. S., Kong, A., k Kjaerulff, U. (1995) Blocking Gibbs sampling in very large probabilistic expert sys- tems. International Journal of Human Computer Stud- ies. Special Issue on Real- World Applications of Uncer- tain Reasoning. Lauritzen, S., k Spiegelhalter, D. (1988). Local com- putations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society B, 50, 157-224. McCullagh, P., k Nelder, J. A. (1983) Generalized lin- ear Models. London: Chapman and Hall. Neal, R. (1992) Connectionist learning of belief net- works. Artificial Intelligence, 56, 71-113. Neal, R., k Hinton, G. (1993) A new view of the EM algorithm that justifies incremental and other variants. Submitted for publication. Parisi, G. (1988) Statistical Field Theory. Redwood City, CA: Addison- Wesley. Pearl, J. (1988) Probabilistic Reasoning in Intel Systems. San Mateo, CA: Morgan Kaufmann. lent Peterson, C, k Anderson, J.R. (1987) A mean field the- ory learning algorithm for neural networks. Complex Systems, 1, 995-1019. Press, W. H., Flannery, B. P., Teukolsky, S.A., k Vetter- ling, W. T. (1986) Numerical Recipes. Cambrige: Cam- bridge University Press. Russell, S., Binder, J., Koller, D., k Kanazawa, K. (1995). Local learning in probabilistic networks with hidden variables. In Proceedings of IJCAI-95. Saul, L., k Jordan, M. (1995) Exploiting tractable sub- structures in intractable networks. In D. Touretzky, M. Mozer, and M. Hasselmo (eds). Advances of Neural In- formation Processing Systems: Proceedings of the 1995 Conference. Seung, H. (1995). Annealed theories of learning. In J.- H. Oh, C. Kwon, and S. Cho, eds. Neural Networks: The Statistical Mechanics Perspective, Proceedings of the CTP-PRSRI Joint Workshop on Theoretical Physics. Singapore, World Scientific.