Skip to main content

Full text of "mit :: ai :: aim :: AIM-1570"

See other formats


MASSACHUSETTS INSTITUTE OF TECHNOLOGY 
ARTIFICIAL INTELLIGENCE LABORATORY 

A.I. Memo No. 1570 August 1996 

Mean Field Theory for Sigmoid Belief Networks 

Lawrence K. Saul, Tommi Jaakkola, and Michael I. Jordan 

This publication can be retrieved by anonymous ftp to publications.ai.mit.edu. 



Abstract 

We develop a mean field theory for sigmoid belief networks based on ideas from statistical mechanics. 
Our mean field theory provides a tractable approximation to the true probability distribution in these 
networks; it also yields a lower bound on the likelihood of evidence. We demonstrate the utility of this 
framework on a benchmark problem in statistical pattern recognition — the classification of handwritten 
digits. 



Copyright © Massachusetts Institute of Technology, 1996 

This report describes research done at the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. 
Support for this research was provided in part by the Advanced Research Projects Agency of the Department of Defense 
under Office of Naval Research contract N00014-94-1-0777. The authors were also supported by NSF grant CDA-9404932, 
ATR Research Laboratories, and Siemens Corporation. 



1 Introduction 

Bayesian belief networks (Pearl, 1988; Lauritzen & 
Spiegelhalter, 1988) provide a rich graphical represen- 
tation of probabilistic models. The nodes in these net- 
works represent random variables, while the links rep- 
resent causal influences. These associations endow di- 
rected acyclic graphs (DAGs) with a precise probabilis- 
tic semantics. The ease of interpretation afforded by this 
semantics explains the growing appeal of belief networks, 
now widely used as models of planning, reasoning, and 
uncertainty. 

Inference and learning in belief networks are possible 
insofar as one can efficiently compute (or approximate) 
the likelihood of observed patterns of evidence (Bun- 
tine, 1994; Russell, Binder, Koller, & Kanazawa, 1995). 
There exist provably efficient algorithms for computing 
likelihoods in belief networks with tree or chain-like ar- 
chitectures. In practice, these algorithms also tend to 
perform well on more general sparse networks. However, 
for networks in which nodes have many parents, the ex- 
act algorithms are too slow (Jensen, Kong, & Kjaefulff, 
1995). Indeed, in large networks with dense or layered 
connectivity, exact methods are intractable as they re- 
quire summing over an exponentially large number of 
hidden states. 

One approach to dealing with such networks has been 
to use Gibbs sampling (Pearl, 1988), a stochastic simu- 
lation methodology with roots in statistical mechanics 
(Geman & Geman, 1984). Our approach in this pa- 
per relies on a different tool from statistical mechanics — 
namely, mean field theory (Parisi, 1988). The mean field 
approximation is well known for probabilistic models 
that can be represented as undirected graphs — so-called 
Markov networks. For example, in Boltzmann machines 
(Ackley, Hinton, & Sejnowski, 1985), mean field learn- 
ing rules have been shown to yield tremendous savings 
in time and computation over sampling-based methods 
(Peterson & Anderson, 1987). 

The main motivation for this work was to extend the 
mean field approximation for undirected graphical mod- 
els to their directed counterparts. Since belief networks 
can be transformed to Markov networks, and mean field 
theories for Markov networks are well known, it is natu- 
ral to ask why a new framework is required at all. The 
reason is that probabilistic models which have compact 
representations as DAGs may have unwieldy represen- 
tations as undirected graphs. As we shall see, avoiding 
this complexity and working directly on DAGs requires 
an extension of existing methods. 

In this paper we focus on sigmoid belief networks 
(Neal, 1992), for which the resulting mean field theory 
is most straightforward. These are networks of binary 
random variables whose local conditional distributions 
are based on log-linear models. We develop a mean field 
approximation for these networks and use it to compute 
a lower bound on the likelihood of evidence. Our method 
applies to arbitrary partial instantiations of the variables 
in these networks and makes no restrictions on the net- 
work topology. Note that once a lower bound is available, 
a learning procedure can maximize the lower bound; this 
is useful when the true likelihood itself cannot be com- 



puted efficiently. A similar approximation for models of 
continous random variables is discussed by Jaakkola et 
al (1995). 

The idea of bounding the likelihood in sigmoid belief 
networks was introduced in a related architecture known 
as the Helmholtz machine (Hinton, Dayan, Frey, & Neal 
1995). A fundamental advance of this work was to es- 
tablish a framework for approximation that is especially 
conducive to learning the parameters of layered belief 
networks. The close connection between this idea and 
the mean field approximation from statistical mechan- 
ics, however, was not developed. 

In this paper we hope not only to elucidate this con- 
nection, but also to convey a sense of which approxima- 
tions are likely to generate useful lower bounds while, 
at the same time, remaining analytically tractable. We 
develop here what is perhaps the simplest such approx- 
imation for belief networks, noting that more sophisti- 
cated methods (Jaakkola & Jordan, 1996a; Saul & Jor- 
dan, 1995) are also available. It should be emphasized 
that approximations of some form are required to handle 
the multilayer neural networks used in statistical pattern 
recognition. For these networks, exact algorithms are 
hopelessly intractable; moreover, Gibbs sampling meth- 
ods are impractically slow. 

The organization of this paper is as follows. Section 2 
introduces the problems of inference and learning in sig- 
moid belief networks. Section 3 contains the main contri- 
bution of the paper: a tractable mean field theory. Here 
we present the mean field approximation for sigmoid be- 
lief networks and derive a lower bound on the likelihood 
of instantiated patterns of evidence. Section 4 looks at 
a mean field algorithm for learning the parameters of 
sigmoid belief networks. For this algorithm, we give re- 
sults on a benchmark problem in pattern recognition — 
the classification of handwritten digits. Finally, section 5 
presents our conclusions, as well as future issues for re- 
search. 

2 Sigmoid Belief Networks 

The great virtue of belief networks is that they clearly 
exhibit the conditional dependencies of the underlying 
probability model. Consider a belief network defined 
over binary random variables S = (Si , S2, ■ ■ ■ , £jv)- We 
denote the parents of Si by pa(S;) C {Si,S2, • • . S;_i}; 
this is the smallest set of nodes for which 



P(5,-|5i,5 2 ,...,5,-_i) = P(5,-|pa(5,-)). 



(1) 



In sigmoid belief networks (Neal, 1992), the conditional 
distributions attached to each node are based on log- 
linear models. In particular, the probability that the ith 
node is activated is given by 



P(Si = l|pa(S;)) = a '^JijSj + h{ 



(2) 



where J 8 j and hi are the weights and biases in the net- 
work, and 

a(z) = — ^ (3) 



where 



p(v) = J2p(h,v) 



(8) 



Figure 1: Sigmoid function u{z) = [1 + e~ z ]~ l . If z 
is the sum of weighted inputs to node S, then P(S = 
l|z) = t(z) is the conditional probability that node S is 
activated. 



is the sigmoid function shown in Figure 1. In sigmoid 
belief networks, we have Jij = for Sj (fi pa(5 , 8 ); more- 
over, Jij = for j > i since the network's structure is 
that of a directed acyclic graph. 

The sigmoid function in eq. (2) provides a com- 
pact parametrization of the conditional probability 
distributions 1 in eq. (2) used to propagate beliefs. In 
particular, P(S;|pa(S;)) depends on pa(5 , 8 ) only through 
a sum of weighted inputs, where the weights may be 
viewed as the parameters in a logistic regression (Mc- 
Cullagh & Nelder, 1983). The conditional probability 
distribution for Si may be summarized as: 

ex P \\J2i J ij s j + h i) Si\ 

P(5,-|pa(5,-)) = ^-} J ~^- (4) 

1 + exp |£\ JijSj + hA 

Note that substituting Si = 1 in eq. (4) recovers the 
result in eq. (2). Combining eqs. (1) and (4), we may 
write the joint probability distribution over the variables 
in the network as: 



P(S) 



Y[p(s t \MSi)) 



(5) 



T-, f ex P (Ej JijSj + hi) SA ] 

= III — M H~\- (6) 

i [ 1 + exp \J2j JijSj +hA J 

The denominator in eq. (6) ensures that the probability 
distribution is normalized to unity. 

We now turn to the problem of inference in sigmoid 
belief networks. Absorbing evidence divides the units 
in the belief network into two types, visible and hid- 
den. The visible units (or "evidence nodes") are those 
for which we have instantiated values; the hidden units 
are those for which we do not. When there is no possible 
ambiguity, we will use H and V to denote the subsets of 
hidden and visible units. Using Bayes' rule, inference is 
done under the conditional distribution 



P(H\V) 



P(H, V) 
P(V) ■ 



(7) 



H 



J The relation to noisy-OR models is discussed in ap- 
pendix A. 



is the likelihood of the evidence V . In principle, the like- 
lihood may be computed by summing over all 2' H ' config- 
urations of the hidden units. Unfortunately, this calcula- 
tion is intractable in large, densely connected networks. 
This intractability presents a major obstacle to learning 
parameters for these networks, as nearly all procedures 
for statistical estimation require frequent estimates of 
the likelihood. The calculations for exact probabilistic 
inference are beset by the same difficulties. Unable to 
compute P(V) or work directly with P(H\V), we will re- 
sort to an approximation from statistical physics known 
as mean field theory. 

3 Mean Field Theory 

The mean field approximation appears under a multi- 
tude of guises in the physics literature; indeed, it is 
"almost as old as statistical mechanics" (Itzykson & 
Drouffe, 1991). Let us briefly explain how it acquired 
its name and why it is so ubiquitous. In the physical 
models described by Markov networks, the variables Si 
represent localized magnetic moments (e.g., at the sites 
of a crystal lattice), and the sums ^. JijSj+hi represent 
local magnetic fields. Roughly speaking, in certain cases 
a central limit theorem may be applied to these sums, 
and a useful approximation is to ignore the fluctuations 
in these fields and replace them by their mean value — 
hence the name, "mean field" theory. In some models, 
this is an excellent approximation; in others, a poor one. 
Because of its simplicity, however, it is widely used as a 
first step in understanding many types of physical phe- 
nomena. 

Though this explains the philological origins of mean 
field theory, there are in fact many ways to derive what 
amounts to the same approximation (Parisi, 1988). In 
this paper we present the formulation most appropriate 
for inference and learning in graphical models. In partic- 
ular, we view mean field theory as a principled method 
for approximating an intractable graphical model by a 
tractable one. This is done via a variational principle 
that chooses the parameters of the tractable model to 
minimize an entropic measure of error. 

The basic framework of mean field theory remains the 
same for directed graphs, though we have found it neces- 
sary to introduce extra mean field parameters in addition 
to the usual ones. As in Markov networks, one finds a set 
of nonlinear equations for the mean field parameters that 
can be solved by iteration. In practice, we have found 
this iteration to converge fairly quickly and to scale well 
to large networks. 

Let us now return to the problem posed at the end 
of the last section. There we found that for many be- 
lief networks, it was intractable to decompose the joint 
distribution as P(S) = P(H\V)P(V), where P(V) was 
the likelihood of the evidence V . For the purposes of 
probabilistic modeling, mean field theory has two main 
virtues. First, it provides a tractable approximation, 
Q(H\V) ps P(H\V), to the conditional distributions re- 



quired for inference. Second, it provides a lower bound 
on the likelihoods required for learning. 

Let us first consider the origin of the lower bound. 
Clearly, for any approximating distribution Q(H\V), we 
have the equality: 



InP(V) 



In Y,P(H,V) 



H 



ln£Q(ff|V0 



H 



P(H, V) 



Q(H\V) 



(9) 



(10) 



To obtain a lower bound, we now apply Jensen's in- 
equality (Cover & Thomas, 1991), pushing the logarithm 
through the sum over hidden states and into the expec- 
tation: 

It is straightforward to verify that the difference between 
the left and right hand side of eq. (11) is the Kullback- 
Leibler divergence (Cover & Thomas, 1991): 

-Q(H\V) 



(11) 



KL(Q\\P) 



H 



Q{H\V)\n 



P(H\V) 



(12) 



Thus, the better the approximation to P(H\V), the 
tighter the bound on InP(V). 

Anticipating the connection to statistical mechanics, 
we will refer to Q(H\V) as the mean field distribution. 
It is natural to divide the calculation of the bound into 
two components, both of which are particular averages 
over this approximating distribution. These components 
are the mean field entropy and energy; the overall bound 
is given by their difference: 



InP(V) > 



J2Q(H\V)lnQ(H\V)) (13) 



H 



J2Q(H\V)\nP(H,V) 



\ H / 

Both terms have physical interpretations. The first mea- 
sures the amount of uncertainty in the mean-field dis- 
tribution and follows the standard definition of entropy. 
The second measures the average value 2 of — In P(H, V); 
the name "energy" arises from interpreting the prob- 
ability distributions in belief networks as Boltzmann 



distributions at unit temperature. In this 



the 



ergy of each network configuration is given (up to a con- 
stant) by minus the logarithm of its probability under 



2 A similar average is performed in the E-step of an EM 
algorithm (Dempster, Laird, & Rubin, 1977); the difference 
here is that the average is performed over the mean field dis- 
tribution, Q(H\V), rather than the true posterior, P(H\V). 
For a related discussion, see Neal & Hinton (1993). 

Our terminology is as follows. Let S denote the degrees 
of freedom in a statistical mechanical system. The energy of 
the system, £(S), is a real- valued function of these degrees 
of freedom, and the Boltzmann distribution 



P(S) 



-I3£(S) 



Zs 



-I3£(S) 



defines a probability distribution over the possible configu- 



the Boltzmann distribution. In sigmoid belief networks, 
the energy has the form 



lnP(H,V) 



/ J 'J ij ^>i 3j y j 11% Oi 



(14) 



E ln 



1 + ex P X] Ji i ^i + hi 



as follows from eq. (6). The first two terms in this equa- 
tion are familiar from Markov networks with pairwise in- 
teractions (Hertz, Krogh, & Palmer, 1991); the last term 
is peculiar to sigmoid belief networks. Note that the 
overall energy is neither a linear function of the weights 
nor a polynomial function of the units. This is the price 
we pay in sigmoid belief networks for identifying P(H \V) 
as a Boltzmann distribution and the log-likelihood P(V) 
as its partition function. Note that this identification 
was made implicitly in the form of eqs. (7) and (8). 

The bound in eq. (11) is valid for any probability dis- 
tribution Q(H\V). To make use of it, however, we must 
choose a distribution that enables us to evaluate the right 
hand side of eq. (11). Consider the factorized distribu- 
tion 

Q(H\V)=l[^(l-^) 1 - s ', (15) 

ieH 
in which the binary hidden units {Si}i^n appear as in- 
dependent Bernoulli variables with adjustable means Hi. 
A mean field approximation is obtained by substituting 
the factorized distribution, eq. (15), for the true Boltz- 
mann distribution, eq. (7). It may seem that this ap- 
proximation replaces the rich probabilistic dependencies 
in P(H \V) by an impoverished assumption of complete 
factorizability. Though this is true to some degree, the 
reader should keep in mind that the values we choose for 
{/"ijiei? (and hence the statistics of the hidden units) 
will depend on the evidence V . 

The best approximation of the form, eq. (15), is found 
by choosing the mean values, {/j,i}i^H, that minimize 
the Kullback-Leibler divergence, A'_L(Q||_P). This is 
equivalent to minimizing the gap between the true log- 
likelihood, InP(V), and the lower bound obtained from 
mean field theory. The mean field bound on the log- 
likelihood may be calculated by substituting eq. (15) into 
the right hand side of eq. (11). The result of this calcu- 
lation is 



lnP(V) > Y, J H ViVj+Y, h < 



Hi 



(16) 



£0* 



1 i Yl ■ JijSj+hi 
1 + e^i 



(17) 



'Y^\p,i\nm + (1 - fi{)\n(l -m)} 



where (•) indicates an expectation value over the mean 
field distribution, eq. (15). The terms in the first two 



rations of S. The parameter fi is the inverse temperature; it 
serves to calibrate the energy scale and will be fixed to unity 
in our discussion of belief networks. Finally, the sum in the 
denominator — known as the partition function — ensures that 
the Boltzmann distribution is normalized to unity. 



lines of eq. (16) represent the mean field energy, derived 
from eq. (15); those in the third represent the mean field 
entropy. In a slight abuse of notation, we have defined 
mean values //; for the visible units; these of course are 
set to the instantiated values //; £ {0, 1}. 

Note that to compute the average energy in the mean 
field approximation, we must find the expected value 
of (In [1 + e z ']}, where Z{ = ^\- JijSj + hi is the sum of 
weighted inputs to the ith unit in the belief network. Un- 
fortunately, even under the mean field assumption that 
the hidden units are uncorrelated, this average does not 
have a simple closed form. This term does not arise in 
the mean field theory for Markov networks with pair- 
wise interactions; again, it is peculiar to sigmoid belief 
networks. 

In principal, the average may be performed by enu- 
merating the possible states of pa(5 , 8 ). The result of this 
calculation, however, would be an extremely unwieldy 
function of the parameters in the belief network. This re- 
flects the fact that in general, the sigmoid belief network 
defined by the weights Jij has an equivalent Markov net- 
work with Nth order interactions and not pairwise ones. 
To avoid this complexity, we must develop a mean field 
theory that works directly on DAGs. 

How we handle the expected value of (In [1 + e z ']} is 
what distinguishes our mean field theory from previous 
ones. Unable to compute this term exactly, we resort 
to another bound. Note that for any random variable z 
and any real number £, we have the equality: 

(ln[l + e*]) = (ln[e^e-^(l + e')]> (18) 

= Z( z ) + (]n.[e-t* + e^-V*]). (19) 

We can upper bound the right hand side by applying 
Jensen's inequality in the opposite direction as before, 
pulling the logarithm outside the expectation: 



(ln[l + e z ]) < £(z) + In (e~ (z + e^~^ z \ 



(20) 



Setting £ = in eq. (20) gives the standard bound: 
(ln(l + e z )) < ln(l + e*). A tighter bound (Seung, 1995) 
can be obtained, however, by allowing non-zero values 
of £. This is illustrated in Figure 2 for the special case 
where z is a Gaussian distributed random variable with 
zero mean and unit variance. The bound in eq. (20) has 
two useful properties which we state here without proof: 
(i) the right hand side is a convex function of £; (ii) the 
value of £ which minimizes this function occurs in the 
interval £ £ [0, 1]. Thus, provided it is possible to evalu- 
ate eq. (20) for different values of £, the tightest bound 
of this form can be found by a simple one-dimensional 
minimization. 

The above bound can be put to immediate use by 
attaching an extra mean field parameter £; to each unit 
in the belief network. We can then upper bound the 
intractable terms in the mean field energy by 



In 



1 + e 



J2, J '3 S 3+ h ' 



< 6 



In ' 



Z^ 



JijVj + hi (24) 



+ e 



(i-e,>, 



0.95 




0.85 



exact 



0.75„ L 



0.2 



0.4 



0.6 



Figure 2: Bound in eq. (19) for the case where z is nor- 
mally distributed with zero mean and unit variance. In 
this case, the exact result is (ln(l + e z )) = 0.806; the 

bound gives min^ < ln[e~£ -|-e2( 1- £) H = 0.818. The 

standard bound from Jensen's inequality occurs at £ = 
and gives 0.974. 

where Z{ = ^\- JijSj + hi. The expectations inside the 
logarithm can be evaluated exactly for the factorial dis- 
tribution, eq. (15); for example, 

i P -ti*i\ - P ~Uh, TT (, , „. P -tiJi. 



na 



fij + Hje~ 



(22) 



A similar result holds for (e^ 1- ^^ 1 ). Though these av- 
erages are tractable, we will tend not to write them out 
in what follows. The reader, however, should keep in 
mind that these averages do not present any difficulty; 
they are simply averages over products of independent 
random variables, as opposed to sums. 

Assembling the terms in eqs. (16) and (21) gives a 
lower bound InP(V) > Cy , 



A 



£*. 



m/j-j 




-5> 

i 

+ X] [/■*«' ln ^ ! ' + ( 1 ~ ^i) ln(l - A*i)] ; 

i 

on the log-likelihood of the evidence V . So far we have 
not specified the parameters {ni}i£H an d {&}', in par- 
ticular, the bound in eq. (23) is valid for any choice of 
parameters. We naturally seek the values that maxi- 
mize the right hand side of eq. (23). Suppose we fix the 
mean values {ni}i^H an d ask for the parameters {£;} 
that yield the tightest possible bound. Note that the 
right hand side of eq. (23) does not couple terms with £i 
that belong to different units in the network. The min- 
imization over {£;} therefore reduces to N independent 
minimizations over the interval [0, 1]. These can be done 
by any number of standard methods (Press, Flannery, 
Teukolsky, k Vetterling, 1986). 

To choose the means, we set the gradients of the 
bound with respect to {ni}i£H equal to zero. To this 




Figure 3: The Markov blanket of unit Si includes its 
parents and children, as well as the other parents of its 
children. 



Kii 



■lA<- ( " 



„(i-eo* 



(24) 



end, let us define the intermediate matrix: 
d_ 

where Z{ is the weighted sum of inputs to ith unit. Note 
that Kij is zero unless Sj is a parent of Sf, in other 
words, it has the same connectivity as the weight matrix 
Jij. Within the mean field approximation, Kij measures 
the parental influence of Sj on Si given the instantiated 
evidence V . The degree of correlation (positive or nega- 
tive) is measured relative to the other parents of Si. 

The matrix elements of K may be evaluated by ex- 
panding the expectations as in eq. (22); a full derivation 
is given in appendix B. Setting the gradient dCy/d/J-i 
equal to zero gives the final mean field equation: 



A*i 



J2 \- J iJH + J 3i(H -Zj) + K i 



(25) 



where c(-) is the sigmoid function. The argument of the 
sigmoid function may be viewed as an effective input to 
the ith unit in the belief network. This effective input 
is composed of terms from the unit's Markov blanket 
(Pearl, 1988), shown in Figure 3; in particular, these 
terms take into account the unit's internal bias, the val- 
ues of its parents and children, and, through the matrix 
Kji, the values of its children's other parents. In solving 
these equations by iteration, the values of the instanti- 
ated units are propagated throughout the entire network. 
An analogous propagation of information occurs in exact 
algorithms (Lauritzen & Spiegelhalter, 1988) to compute 
likelihoods in belief networks. 

While the factorized approximation to the true poste- 
rior is not exact, the mean field equations set the param- 
eters {ni~}i£H to values which make the approximation 
as accurate as possible. This in turn translates into the 
tightest mean field bound on the log-likelihood. The 
overall procedure for bounding the log-likelihood thus 
consists of two alternating steps: (i) update {£;} for 
fixed {/!{}; (ii) update {ni}i^H for fixed {&}. The first 
step involves N independent minimizations over the in- 
terval [0, 1]; the second is done by iterating the mean 
field equations. In practice, the steps are repeated until 
the mean field bound on the log-likelihood converges 4 to 
a desired degree of accuracy. 




Figure 4: Three layer belief network (2x4x6) with top- 
down propagation of beliefs. To model the images of 
handwritten digits in section 4, we used 8x24x64 net- 
works where units in the bottom layer encoded pixel 
values in 8x8 bitmaps. 



The quality of the bound depends on two approxi- 
mations: the complete factorizability of the mean field 
distribution, eq. (15), and the logarithm bound, eq. (20). 
How reliable are these approximations in belief net- 
works? To study this question, we performed numerical 
experiments on the three layer belief network shown in 
Figure 4. The advantage of working with such a small 
network (2x4x6) is that true likelihoods can be com- 
puted by exact enumeration. We considered the par- 
ticular event that all the units in the bottom layer were 
instantiated to zero. For this event, we compared the 
mean field bound on the likelihood to its true value, ob- 
tained by enumerating the states in the top two layers. 
This was done for 10000 random networks whose weights 
and biases were uniformly distributed between -1 and 1. 
Figure 5 (left) shows the histogram of the relative er- 
ror in log likelihood, computed as £y/ln_P(I/) — 1; for 
these networks, the mean relative error is 1.6%. Figure 5 
(right) shows the histogram that results from assuming 
that all states in the bottom layer occur with equal prob- 
ability; in this case the relative error was computed as 
(ln2 -6 )/ In P(V) — 1. For this "uniform" approximation, 
the root mean square relative error is 22.6%. The large 
discrepancy between these results suggests that mean 
field theory can provide a useful lower bound on the 
likelihood in certain belief networks. Of course, what 
ultimately matters is the behavior of mean field theory 
in networks that solve meaningful problems. This is the 
subject of the next section. 

4 Learning 

One attractive use of sigmoid belief networks is to 
perform density estimation in high dimensional input 
spaces. This is a problem in parameter estimation: given 
a set of patterns over particular units in the belief net- 
work, find the set of weights Jij and biases hi that assign 
high probability to these patterns. Clearly, the ability 
to compute likelihoods lies at the crux of any algorithm 
for learning the parameters in belief networks. 

Mean field algorithms provide a strategy for discov- 
ering appropriate values of Jij and hi without resort to 



It can be shown that asychronous updates of the mean 



field parameters lead to monotonic increases in the lower 
bound (just as in the case of Markov networks). 



mean field approximation 



0.01 0.02 0.03 0.04 0.05 0.06 0.07 
relative error in log-likelihood 



uniform approximation 



4500 
4000 
3500 
3000 
2500 
2000 
1500 
1000 
500 

°< 

4000 1— 
3500 
3000 
2500 
2000 
1500 
1000 
500- 

-1 -0.5 0.5 1 1.5 

relative error in log-likelihood 

Figure 5: Histograms of relative error in log-likelihood 
over 10000 randomly generated three layer networks. On 
top: the relative error from the mean field approxima- 
tion; on bottom: the relative error if all states in the 
bottom layer are assumed to occur with equal probabil- 
ity. The log-likelihood was computed for the event that 
the all the nodes in the bottom layer were instantiated 
to zero. 



Gibbs sampling. Consider, for instance, the following 
procedure. For each pattern in the training set, solve 
the mean field equations for {ni,£i} and compute the 
associated bound on the log-likelihood, Cy . Next, adapt 
the weights in the belief network by gradient ascent 5 in 
the mean field bound, 



A,L 



Ahi 



dJij 

3Cy 

dhi 



(26) 
(27) 



where r\ is a suitably chosen learning rate. Finally, cy- 
cle through the patterns in the training set, maximiz- 
ing their likelihoods 6 for a fixed number of iterations or 
until one detects the onset of overfitting (e.g., by cross- 
validation). 

The above procedure uses a lower bound on the log- 
likelihood as a cost function for training belief networks 
(Hinton, Dayan, Frey, & Neal, 1995). The fact that we 
have a lower bound on the log-likelihood, rather than 
an upper bound, is of course crucial to the success of 
this learning algorithm. Adjusting the weights to maxi- 
mize this lower bound can affect the true log-likelihood 
in two ways (see Figure 6). Either the true log-likelihood 



Expressions for the gradients of Cv are given in the ap- 
pendix B. 

Of course, one can also incorporate prior distributions 
over the weights and biases and maximize an approximation 
to the log posterior probability of the training set. 




-~ -___ Irue log-likelihood 




Figure 6: Relationship between the true log-likelihood 
and its lower bound during learning. One possibility (at 
left) is that both increase together. The other is that 
the true log-likelihood decreases, closing the gap between 
itself and the bound. The latter can be viewed as a form 
of regularization. 


































































































































Figure 7: Binary images of handwritten digits: two and 
five. 



increases, moving in the same direction as the bound, or 
the true log-likelihood decreases, closing the gap between 
these two quantities. For the purposes of maximum like- 
lihood estimation, the first outcome is clearly desirable; 
the second, though less desirable, can also be viewed in 
a positive light. In this case, the mean field approxi- 
mation is acting as a regularizer, steering the network 
toward simple, factorial solutions even at the expense of 
lower likelihood estimates. 

We tested this algorithm by building a maximum- 
likelihood classifier for images of handwritten digits. The 
data consisted of 11000 examples of handwritten digits 
[0-9] compiled by the U.S. Postal Service Office of Ad- 
vanced Technology. The examples were preprocessed to 
produce 8x8 binary images, as shown in Figure 7. For 
each digit, we divided the available data into a training 
set with 700 examples and a test set with 400 examples. 
We then trained a three layer network 7 (see Figure 4) on 
each digit, sweeping through each training set five times 
with learning rate r\ = 0.05. The networks had 8 units 
in the top layers, 24 units in the middle layer, and 64 
units in the bottom layer, making them far too large to 
be treated with exact methods. 

After training, we classified the digits in each test set 
by the network that assigned them the highest likeli- 
hood. Table 1 shows the confusion matrix in which the 
ijth entry counts the number of times digit i was clas- 
sified as digit j. There were 184 errors in classification 



There are many possible architectures that could be cho- 
sen for the purpose of density estimation; we used layered 
networks to permit a comparison with previous benchmarks 
on this data set. 








1 


2 


3 


4 


5 


6 


7 


8 


9 





388 


2 


2 





1 


3 








4 





1 





393 











1 








6 





2 


1 


2 


376 


1 


3 





4 





13 





3 





2 


4 


373 





12 








6 


3 


4 








2 





383 





1 


2 


2 


10 


5 





2 


1 


13 





377 


2 





4 


1 


6 


1 


4 


2 





1 


6 


386 











7 





1 

















388 


3 


8 


8 


1 


9 


1 


7 





7 


1 


1 


369 


4 


9 





4 

















8 


5 


383 



Table 1: Confusion matrix for digit classification. The 
entry in the ith row and jth column counts the number 
of times that digit i was classified as digit j. 



algorithm 


classification error 


nearest neighbor 

back-propagation 

wake-sleep 

mean field 


6.7% 
5.6% 
4.8% 
4.6% 



Table 2: Classification error rates for the data set of 
handwritten digits. The first three were reported by 
Hinton et al (1995). 



(out of a possible 4000), yielding an overall error rate of 
4.6%. Table 2 gives the performance of various other al- 
gorithms on the same partition of this data set. Table 3 
shows the average log-likelihood score of each network 
on the digits in its test set. (Note that these scores are 
actually lower bounds.) These scores are normalized so 
that a network with zero weights and biases (i.e., one in 
which all 8x8 patterns are equally likely) would receive 
a score of -1. As expected, digits with relatively sim- 
ple constructions (e.g., zeros, ones, and sevens) are more 
easily modeled than the rest. 

Both measures of performance — error rate and log- 
likelihood score — are competitive with previously pub- 
lished results (Hinton, Dayan, Frey, & Neal, 1995) on 
this data set. The success of the algorithm affirms both 
the strategy of maximizing a lower bound and the util- 
ity of the mean field approximation. Though similar 
results can be obtained via Gibbs sampling, this seems 
to require considerably more computation than meth- 
ods based on maximizing a lower bound (Frey, Dayan, 
& Hinton, 1995). 

5 Discussion 

Endowing networks with probabilistic semantics pro- 
vides a unified framework for incorporating prior knowl- 
edge, handling missing data, and performing inference 
under uncertainty. Probabilistic calculations, however, 
can quickly become intractable, so it is important to de- 
velop techniques that approximate probability distribu- 
tions in a flexible manner. This is especially true for net- 
works with multilayer architectures and large numbers 
of hidden units. Exact algorithms and Gibbs sampling 
methods are not generally practical for such networks; 



digit 


log-likelihood score 





-0.447 


1 


-0.296 


2 


-0.636 


3 


-0.583 


4 


-0.574 


5 


-0.565 


6 


-0.515 


7 


-0.434 


8 


-0.569 


9 


-0.495 


all 


-0.511 



Table 3: Normalized log-likelihood score for each net- 
work on the digits in its test set. To obtain the raw 
score, multiply by 400 x 64 x In 2. The last row shows 
the score averaged across all digits. 



approximations are required. 

In this paper we have developed a mean field approx- 
imation for sigmoid belief networks. As a computational 
tool, our mean field theory has two main virtues: first, 
it provides a tractable approximation to the conditional 
distributions required for inference; second, it provides 
a lower bound on the likelihoods required for learning. 

The problem of computing exact likelihoods in belief 
networks is NP-hard (Cooper, 1990); the same is true for 
approximating likelihoods to within a guaranteed degree 
of accuracy (Dagum & Luby, 1993). It follows that one 
cannot establish universal guarantees for the accuracy 
of the mean field approximation. For certain networks, 
clearly, the mean field approximation is bound to fail — it 
cannot capture logical constraints or strong correlations 
between fluctuating units. Our preliminary results, how- 
ever, suggest that these worst-case results do not apply 
to all belief networks. It is worth noting, moreover, that 
all the above qualifications apply to Markov networks, 
and that in this domain, mean field methods are already 
well-established. 

The idea of bounding the likelihood in sigmoid be- 
lief networks was introduced in a related architecture 
known as the Helmholtz machine (Hinton, Dayan, Neal, 
& Zemel, 1995). The formalism in this paper differs 
in a number of respects from the Helmholtz machine. 
Most importantly, it enables one to compute a rigor- 
ous lower bound on the likelihood. This cannot be said 
for the wake-sleep algorithm (Frey, Hinton, & Dayan, 
1995), which relies on sampling-based methods, or the 
heuristic approximation of Dayan et al (1995), which 
does not guarantee a rigorous lower bound. Also, our 
mean field theory — which takes the place of the "recog- 
nition model" of the Helmholtz machine — applies gener- 
ally to sigmoid belief networks with or without layered 
structure. Moreover, it places no restrictions on the lo- 
cations of visible units; they may occur anywhere within 
the network — an important feature for handling prob- 
lems with missing data. Of course, these advantages are 
not accrued without extra computational demands and 
more complicated learning rules. 

In recent work that builds on the theory presented 



here, we have begun to relax the assumption of com- 
plete factorizability in eq. (15). In general, one would 
expect more sophisticated approximations to the Boltz- 
mann distribution to yield tighter bounds on the log- 
likelihood. The challenge here is to find distributions 
that allow for correlations between hidden units while 
remaining computationally tractable. By tractable, we 
mean that the choice of Q(H\V) must enable one to eval- 
uate (or at least upper bound) the right hand side of 
eq. (13). Extensions of this kind include mixture mod- 
els (Jaakkola & Jordan, 1996) and/or partially factor- 
ized distributions (Saul & Jordan, 1995) that exploit 
the presence of tractable substructures in the original 
network. Our approach in this paper has been to work 
out the simplest mean field theory that is computation- 
ally tractable, but clearly better results will be obtained 
by tailoring the approximation to the problem at hand. 

A Sigmoid versus Noisy-OR 

The semantics of the sigmoid function are similar, but 
not identical, to the noisy-OR gates (Pearl, 1988) more 
commonly found in the belief network literature. Noisy- 
OR gates use the weights in the network to represent 
independent causal events. In this case, the probability 
that unit Si is activated is given by 



P(5,- = l|pa(5,-)) = l-n( 1 -Py)' 



(28) 



where pij is the probability that Sj = 1 causes Si = 1 in 
the absence of all other causal events. If we define the 
weights of a noisy-OR belief network by 6ij = — ln(l — 
Pij), it follows that 



p(Si\pa(Si)) = p ^OijSj 



(29) 



where 

p(z) = l-e~ z (30) 

is the noisy-OR gating function. Comparing this to 
the sigmoid function, eq. (3), we see that both model 
P(5'i|pa(5'i)) as a monotonically increasing function of 
a sum of weighted inputs. The main difference is that 
in noisy-OR networks, the weights 6ij are constrained 
to be positive by an underlying set of probabilities, pij. 
Recently, Jaakkola and Jordan (1996b) have developed a 
mean field approximation for noisy-OR belief networks. 

B Gradients 

Here we provide expressions for the gradients that ap- 
pear in eqs. (24), (26) and (27). As usual, let Z{ = 
^2j JijSj +hi denote the sum of inputs into unit Si. Un- 
der the factorial distribution, eq. (15), we can compute 
the averages: 



-i,z, 



D (i-e,>.\ 



-e.ft 



n \ i ~ H+w 



■UJi. 



(31) 



,(l-(i)hi 



n 



■ fij 



;+/Uj e( 1 -^)^j 3 2) 



For each unit in the network, let us define the quantity 

/ e (i-e,>.\ 



( e -e,^ +e (i-e,>.)' 



(33) 



Note that <f>i lies between zero and one. With this defi- 
nition, we can write the matrix elements in eq. (24) as: 



Kii 



(i_^.)(i_ e -e,^) ^.(i_ e (i-e,y.,) 



1 — flj + PjC t' J 'J 1 — pj + pj6 i ' l ~ ( ") J <3 



. ( 34 ) 

The gradients in eqs. (26) and (27) are found by similar 
means. For the weights, we have 



dC v 
dJij 



-(& - w)w + y 



(1 - (j)i)jiPje ( -' J " 



Pj + Pj e 



-ZiJi. 



(35) 



1 - pj + p j e( 1 -t') J 'j 
Likewise, for the biases, we have 
dC v 



dhi 



■ Pi - <f>i 



(36) 



Finally, we note that one may obtain simpler gradients at 
the expense of introducing a weaker bound than eq. (20). 
This can be advantageous when speed of computation is 
more important than the quality of the bound. All the 
experiments in this paper used the bound in eq. (20). 

Acknowledgements 

We are especially grateful to P. Dayan, G. Hinton, B. 
Frey, R. Neal, and H. Seung for sharing early versions 
of their manuscripts and for providing many stimulat- 
ing discussions about this work. The paper was also 
improved greatly by the comments of several anony- 
mous reviewers. To facilitate comparisons with simi- 
lar methods, the results reported in this paper used im- 
ages that were preprocessed at the University of Toronto. 
The authors acknowledge support from NSF grant CDA- 
9404932, ONR grant N00014-94-1-0777, ATR Research 
Laboratories, and Siemens Corporation. 

References 

Ackley, D., Hinton, C, & Sejnowski, T. (1985) A learn- 
ing algorithm for Boltzmann machines. Cognitive Sci- 
ence, 9, 147-169. 

Buntine, W. (1994) Operations for learning with graph- 
ical models. Journal of Artificial Intelligence Research, 
2, 159-225. 

Cooper, G. (1990) Computational complexity of proba- 
bilistic inference using Bayesian belief networks. Artifi- 
cial Intelligence, J t 2, 393-405. 

Cover, T., & Thomas, J. (1991) Elements of Information 
Theory. New York: John Wiley & Sons. 

Dagum, P., & Luby, M. (1993) Approximately proba- 
bilistic reasoning in Bayesian belief networks is NP-hard. 
Artificial Intelligence, 60, 141-153. 



Dayan, P., Hinton, G., Neal, R., k Zemel, R. (1995) The 
Helmholtz machine. Neural Computation, 7, 889-904. 

Dempster, A., Laird, N., and Rubin, D. (1977) Max- 
imum likelihood from incomplete data via the EM al- 
gorithm. Journal of the Royal Statistical Society B39, 
1-38. 

Frey, B., Hinton, G., k Dayan, P. (1995) Does the wake- 
sleep algorithm learn good density estimators? In D. 
Touretzky, M. Mozer, and M. Hasselmo (eds). Advances 
of Neural Information Processing Systems: Proceedings 
of the 1995 Conference. 

Geman, S., k Geman, D. (1984) Stochastic relaxation, 
Gibbs distributions, and the Bayesian restoration of im- 
ages. IEEE Transactions on Pattern Analysis and Ma- 
chine Intelligence, 6, 721-741. 

Hertz, J., Krogh, A., and Palmer, R. G. (1991) Intro- 
duction to the Theory of Neural Computation. Redwood 
City, CA: Addison- Wesley. 

Hinton, G., Dayan, P., Frey, B., k Neal, R. (1995) The 
wake-sleep algorithm for unsupervised neural networks. 
Science, 268, 1158-1161. 

Itzykson, C, k Drouffe, J.M. (1991). Statistical Field 
Theory. Cambridge: Cambridge University Press. 

Jaakkola, T., Saul, L., k Jordan, M. (1995) Fast learn- 
ing by bounding likelihoods in sigmoid-type belief net- 
works. In D. Touretzky, M. Mozer, and M. Hasselmo 
(eds). Advances of Neural Information Processing Sys- 
tems: Proceedings of the 1995 Conference. 

Jaakkola, T., k Jordan, M. (1996a) Mixture model ap- 
proximations for belief networks. Manuscript in prepa- 
ration. 

Jaakkola, T., k Jordan, M. (1996b) Computing upper 
and lower bounds on likelihoods in intractable networks. 
Submitted. 

Jensen, C. S., Kong, A., k Kjaerulff, U. (1995) Blocking 
Gibbs sampling in very large probabilistic expert sys- 
tems. International Journal of Human Computer Stud- 
ies. Special Issue on Real- World Applications of Uncer- 
tain Reasoning. 

Lauritzen, S., k Spiegelhalter, D. (1988). Local com- 
putations with probabilities on graphical structures and 
their application to expert systems. Journal of the Royal 
Statistical Society B, 50, 157-224. 

McCullagh, P., k Nelder, J. A. (1983) Generalized lin- 
ear Models. London: Chapman and Hall. 

Neal, R. (1992) Connectionist learning of belief net- 
works. Artificial Intelligence, 56, 71-113. 

Neal, R., k Hinton, G. (1993) A new view of the EM 
algorithm that justifies incremental and other variants. 
Submitted for publication. 

Parisi, G. (1988) Statistical Field Theory. Redwood City, 



CA: Addison- Wesley. 

Pearl, J. (1988) Probabilistic Reasoning in Intel 
Systems. San Mateo, CA: Morgan Kaufmann. 



lent 



Peterson, C, k Anderson, J.R. (1987) A mean field the- 
ory learning algorithm for neural networks. Complex 
Systems, 1, 995-1019. 

Press, W. H., Flannery, B. P., Teukolsky, S.A., k Vetter- 
ling, W. T. (1986) Numerical Recipes. Cambrige: Cam- 
bridge University Press. 

Russell, S., Binder, J., Koller, D., k Kanazawa, K. 
(1995). Local learning in probabilistic networks with 
hidden variables. In Proceedings of IJCAI-95. 

Saul, L., k Jordan, M. (1995) Exploiting tractable sub- 
structures in intractable networks. In D. Touretzky, M. 
Mozer, and M. Hasselmo (eds). Advances of Neural In- 
formation Processing Systems: Proceedings of the 1995 
Conference. 

Seung, H. (1995). Annealed theories of learning. In J.- 
H. Oh, C. Kwon, and S. Cho, eds. Neural Networks: 
The Statistical Mechanics Perspective, Proceedings of 
the CTP-PRSRI Joint Workshop on Theoretical Physics. 
Singapore, World Scientific.