Skip to main content

Full text of "mit :: ai :: aim :: AIM-1291"

See other formats


MASSACHUSETTS INSTITUTE OF TECHNOLOGY 
ARTIFICIAL INTELLIGENCE LABORATORY 

and 

CENTER FOR BIOLOGICAL INFORMATION PROCESSING 

WHITAKER COLLEGE 

A.I. Memo No. 1291 April 1992 

C.B.I.P. Paper No. 

A Connection between GRBF and MLP 

Minoru Maruyama, Federico Girosi and Tomaso Poggio 

Abstract 

Both multilayer perceptrons (MLP) and Generalized Radial Basis Functions (GRBF) 
have good approximation properties, theoretically and experimentally. Are they re- 
lated? The main point of this paper is to show that for normalized inputs, multilayer 
perceptron networks are radial function networks (albeit with a non-standard radial 
function). This provides an interpretation of the weights w as centers t of the radial 
function network, and therefore as equivalent to templates. This insight may be useful 
for practical applications, including better initialization procedures for MLP. In the 
remainder of the paper, we discuss the relation between the radial functions that cor- 
respond to the sigmoid for normalized inputs and well-behaved radial basis functions, 
such as the Gaussian. In particular, we observe that the radial function associated 
with the sigmoid is an activation function that is good approximation to Gaussian 
basis functions for a range of values of the bias parameter. The implication is that a 
MLP network can always simulate a Gaussian GRBF network (with the same number 
of units but less parameters); the converse is true only for certain values of the bias 
parameter. Numerical experiments indicate that this constraint is not always satisfied 
in practice by MLP networks trained with backpropagation. Multiscale GRBF net- 
works, on the other hand, can approximate MLP networks with a similar number of 
parameters. 

© Massachusetts Institute of Technology, 1992 

This paper describes research done within the Center for Biological Information Processing, in the 
Department of Brain and Cognitive Sciences, and at the Artificial Intelligence Laboratory. This 
research is sponsored by a grant from the Office of Naval Research (ONR), Cognitive and Neural Sci- 
ences Division; by the Artificial Intelligence Center of Hughes Aircraft Corporation; by the Alfred P. 
Sloan Foundation; by the National Science Foundation. Support for the A. I. Laboratory's artificial 
intelligence research is provided by the Advanced Research Projects Agency of the Department of 
Defense under Army contract DACA76-85-C-0010, and in part by ONR contract N00014-85-K-0124. 



1 Introduction 

Techniques and networks for learning from examples may be considered as 
methods for approximating multivariate functions from sparse data (the "ex- 
amples"). In recent papers we have developed one such technique that we 
have called regularization networks, of which Radial Basis Functions are a 
special case and Hyper Basis Functions are the most general form (see for 
a review Poggio and Girosi, 1990 and references therein). The underlying 
theory is quite well developed and understood: for instance the role of the 
hidden units in the corresponding network (see Fig. (1)) is easily explained. 
The method has been also demonstrated to work well in a number of practical 
applications. Another technique, which is extremely popular, is associated 
with multilayer perceptrons, typically used in conjunction with a version of 
gradient descent (for estimating the parameters) called backpropagation. In 
this paper, we will consider multilayer perceptrons with one hidden layer and 
linear output units. These networks have been used successfully in many 
cases of learning from examples. The underlying theory is less developed, 
though a few theoretical results have been obtained in recent months. In 
particular, it has been proved that MLP have what we call the Weierstrass 
property, that is they can approximate arbitrarily well - provided enough 
units are available - any continuous function on a compact interval (Hornik, 
Stinchcombe and White, 1989; Stinchcombe and White, 1989; Carroll and 
Dickinson, 1989; Cybenko, 1989; Funahashi, 1989). Regularization networks 
also have this property, shared by many other approximation schemes (Girosi 
and Poggio, 1989). 

It is natural to explore the relation between the two techniques, especially 
because the corresponding networks have superficially a similar structure, 
both having one hidden layer of units as shown by Fig. (1). 

The network of Fig. (1) represents the class of functions of the type 

/(x) = f># t (x) (1) 

where Hi are functions that depend on some parameters to be estimated 
together with the coefficients c t -. When the functions Hi are kept fixed, the 
function /(x) is linear in its parameters (the c*), and the resulting approxi- 
mation scheme is linear. 




Figure 1: The most general network with one layer of hidden units. Here 
we show the two-dimensional case, in which x = (x,y). Each function Hi 
can depend on a set of unknown parameters, that are computed during the 
learning phase, as well as the coefficients C;. When Hi = <r(w; • x + #,) the 
network is a multilayer perceptron; when Hi = h(x — t,-) the network is a 
regularization network for appropriate choices of h. 



Depending on the function //, different approximation schemes can be 
obtained. Two common choices are the following: 

1. Regularization networks, that correspond to the choice: 

ff t -(x) = A(l|x-t,-||w) 

where h is a (conditionally) positive definite function, the t; are d- 
dimensional vectors called "centers", W is a d x d matrix, and we have 
defined the weighted norm \\ • ||\y as 

||x||$v = x r W T Wx . 

We call this scheme Hyperbf. RBF is the case in which the centers 
coincide with the data points, and GRBF (Generalized Radial Basis 
Functions) is the case in which the centers t a are free parameters to be 
estimated but the weight matrix W is fixed and equal to the identity 
matrix. 

This class of approximation schemes can be formally derived, in the 
framework of regularization theory, from a variational principle which 
has a Bayesian interpretation. It includes: 

• kernel estimation techniques and Parzen windows 

• splines 

• Hyperbf and RBF 

2. Ridge functions approximation: 

H i {x) = h i (x-w i + 6 i ) 

where the w t - are rf-dimensional vectors called "weights", and the pa- 
rameters $i constitutes bias terms. This form of approximation did not 
have until now any variational or Bayesian interpretation. Very recent 
results by Poggio and Girosi (unpublished) show that a set of ridge 
function approximation schemes can be derived as limits of regulariza- 
tion networks for an appropriate class of stabilizers. Several techniques 
are included in this class: 



• Projection Pursuit Regression (PPR): it corresponds to an 
expansion of the type eq. (1) in which all the coefficients c t - are 
set to 1, and 

Hi(x) = hi(x • w,-) 

where the vectors W; are normalized (||w;|| = 1). The functions 
hi are determined by means of some nonparametric estimation 
technique, in the following iterative way: 

(a) Assume that we already have k — 1 terms. Let 

fe-i 

3=1 

be the residual of the approximation; 

(b) Search for the next term. Calculate the following sum of the 
residulas 

N 
J2( r i - 9k{*i • Wfc)) 2 

i=l 

and find the direction w^ which minimize the sum of residuals 
and the corresponding function h^. 

• Flexible Fourier series: the function hi are all equal to the 
cosine (or sine) function, and therefore: 

Hi(x) = cos(x -Wi + Oi) . (2) 

If we assume that the function g underlying the data has a Fourier 
transform g(s) = \g(s)\e %e ( s \ then 

<?(x) = 3? Qk <*s e isx |s(s)|e' e ( s >) = J^ ds \~g(s)\cos(s-x+0(s)) , 

(3) 
and the expansion of eq. (1) with the choice (2) looks like a cu- 
bature formula for the integral above, where the vectors W; are 
the points at which the integrand is sampled. In this case the 



interpretation of the "weights" w,- is clear: they represent the 
fundamental frequency content of the underlying function. It can 
also be noticed that the problem of finding the optimal param- 
eters of the parametric function is equivalent to the problem of 
finding the optimal sample points to use in the discretization of 
the integral above. 

Multilayer Perceptrons: the functions hi are all equal to a 
sigmoidal function: 

Hi(x) = <r(x • w t - + Oi) . (4) 

The function er is usually defined as 

<y(x) — 

but other choices, (as the hyperbolic tangent), are equivalent, as 
long as the sigmoidal shape is mantained. 

While in the approximation by trigonometric functions the pa- 
rameters w,- have a simple interpretation in terms of frequencies, 
in this case the meaning of the w,- is less clear. If the expansion 
(4) is considered from the point of view of projection pursuit re- 
gression, (Friedman and Stuetzle, 1981; Huber, 1985) the Wj are 
the "relevant directions" that are supposed to encode the most 
information about the function. 

Exponential sums: when the problem is one dimensional, a 
well known non linear technique consists in approximation by ex- 
ponential sums (Braess, 1986; Gosselin, 1986; Hobby and Rice, 
1967), and a number of results are available in this case. In more 
than one dimension, the natural extension corresponds to ridge 
function approximation with the choice: 

Hi(x) = e~ xw > . (5) 

Notice that the bias terms disappear, since they are absorbed by 
the coefficients. 



We group the ridge function approximation schemes together because 
they all have the same general form: linear combination of nonlinear 
function of the scalar product of the input vector with the parameter 
vector. 

The main difference between these two classes of approximation schemes 
(which, as we mentioned earlier, can be both derived from regularization in 
terms of two different classes of stabilizers, reflecting different prior assump- 
tions x ) seems to be related to the use of the scalar product x • w,- instead of 
the weighted Euclidean distance ||x — t,||\y, as argument of the parametric 
functions h. 

At first sight, these two broad classes of techniques do not seem to have 
any relationship. The main point of this paper is to show that in some special 
situations there is a close connection between these two classes of approx- 
imation schemes and in particular between Gaussian GRBF (i.e. Hyperbf 
networks with W = /) and MLP with sigmoidal units in the hidden layer. 

2 Normalized Inputs: Ridge Functions are 
Radial Functions 

In the case in which all the inputs variables are normalized, that is they lie 
on the unit d- dimensional sphere, there is a simple connection between ridge 
and radial functions. In fact the following identity 

||x-t|| 2 = ||x|| 2 + ||t|| 2 -2x.t (6) 

for ||x|| = 1 becomes: 

||x-t|| 2 = l + ||ti| 2 -2x -t (7) 

We can now use identity (7) to obtain a ridge regression scheme from a 
Radial Basis Functions scheme and vice versa. 



1 The formulation of the learning problem in terms of regularization is satisfying from 
a theoretical point of view, since it establishes connections with a large body of results in 
the area of Bayes estimation and in the theory of approximation of multivariate functions. 



• From radial basis functions to multilayer perceptrons. 

Substituting identity (7) in the Radial Basis Functions expansion 

/(x) = £> Q tf(||x-t Q || 2 ) (8) 



we obtain 



N 

/(x)=£c Q tf(x.t a + a ), (9) 

where we have denned 

H(x) = H(-2x), Q = -I(i + || ta || 2 ). (10) 

We notice that eq. (10) is the expansion corresponding to a multilayer 
perceptron network. The only difference is that while in the multilayer 
perceptron network the bias parameter 9 a is allowed to vary along the 
real line, in this case it is costrained to lie in the interval (— oo, —|]. We 
therefore can conclude that, when inputs are normalized, given the RBF 
network of eq. (8) with activation function H it is always possible to 
define a multilayer perceptron with the same number of units and with 
activation function H that computes the same function. The synaptic 
weights connecting the input and the hidden layer are the centers of the 
Radial Basis Functions expansion, and the bias parameters are uniquely 
determined by the synaptic weights. 

• From multilayer perceptron to radial basis functions. 

In the previous case we have seen that Radial Basis Functions can 
be simulated by multilayer perceptron. This is not surprising since a 
Radial Basis Functions unit has one parameter less than a multilayer 
perceptron unit. For the same reason we cannot expect to simulate 
a multilayer perceptron unit with d inputs in terms of a Radial Basis 
Functions unit with the same number of inputs. However this may 



be possible if we add to the Radial Basis Functions units a dummy 
input, so that the coordinate of the center corresponding to the dummy 
variable gives the missing parameter. Let us consider the multilayer 
perceptron expansion: 

TV 

/(x) = £ c Q <r(x ■ w a + 6 a ) . (11) 

a=l 

Using identity (7) we obtain: 

/(X)=£;^(-I||X- W a || a + </*), (12) 

a=l Z 

where we have defined 

d a = \(i + \\™ a \\ 2 ) + e a 

We now rewrite the expansion as a Radial Basis Functions expansion 
in d + 1 dimensions, in which one of the input variables is kept fixed, 
and equal to some number (1, for simplicity). Introducing the d + 1 
dimensional vectors 

x = (x, 1) , w a = (w,t>«) . 
we can rewrite the expansion (12) as 

/(x) = £ c Q <7(-i||x - w a || 2 + 1(1 - v a f + d a ) . (13) 

Expansion (13) becomes a Radial Basis Functions expansion if one of 
the two following conditions is satisfied: 

1. There exists v a such that 

1 



l(l-v a ) 2 = -d a ; (14) 



2. There exist functions g x and g 2 such that 

a {~2 X + y) = 9i{x)92(y) ■ (15) 

In the first case, in fact, the multilayer perceptron expansion becomes 

/(x) = f> a /K||x-w a || 2 ) (16) 

where we have defined the function h(x) — a{— \x). In the second case 
the multilayer perceptron expansion becomes 

/(x) = E C Ui(ll*-w„|| 2 ). (17) 

where c' a = g 2 (d a + |(1 - v Q ) 2 ). 
Remarks 



In case (1), a solution to eq. (14) does not exist unless the condi- 
tion 

-(l + ||w a || 2 ) + a <O 

is satisfied. This condition on the weights w Q and on the bias 
parameter 6 a defines a subclass of multilayer perceptrons that 
can be simulated by a Radial Basis Function network with the 
same number of units and a dummy input, and therefore the same 
number of parameters. 

In case (2), evaluating eq. (15) first at x = and y = we obtain 
that 

, . <t(— \x) a(y) 

92{0) gi(0) 



and after some algebra 

,(, + ,)-^. (18, 

It is well known that this functional equation has only one class 
of solutions, given by 

cr(x) — ce x , Vc € R . 

Therefore radial units can simulate arbitrary multilayer percep- 
tron units only in the case in which the activation function of the 
multilayer perceptron network is an exponential function. In this 
case the corresponding Radial Basis Functions network is a Gaus- 
sian network. In fact, setting o(x) = e x in eq. (13) we obtain the 
expansion 

N 



/(x) = £^H |ix - Wa||2) 



a=l 



Notice also that in this case there is no need of dummy input, 
since the bias term can be dropped out by a redefinition of the 
coefficients, due to the property (18) of the exponential function. 

3 Sigmoid MLPs and Gaussian GRBFs 

In this section, we compare the sigmoid and Gaussian function for normal- 
ized input vectors ||x|| = 1. Under these conditions the sigmoidal function 
becomes a radial function, that can approximate any Gaussian function well 
enough within a certain range of the bias parameter 6. 

For normalized inputs, any ridge functions is equivalent to an appropriate 
radial function. Consider the sigmoid function given by 

<r(w • x + 0) = — -i -— weR^deR 

. 1 4- exp(— (w • x + 9)) 
The corresponding radial function parametrized by A £ ft is : 



10 



s < - - -,<* 



s<t{w ■ x + 9) = — -— -— • (s € R, w € R d , 6 € R) 

1 + exp(-(w • x + 0)) 

(C(X) > 0) (19) 



l + C(A)exp(A|jx-t|n 
where we have defined : 

t = ^, C(X) = exp(-(0 + A(l + Ml))) (20) 

3.1 Sigmoids units can approximate Gaussian units 

The existence of free parameter A indicates that any element of the one- 
parameter family given by (19) is equivalent to each other for normalized 
inputs. To compare Gaussians and the radial functions associated with sig- 
moids, we should measure the discrepancy between the two functions. For 
the purpose of the comparison, we first take the closest function to a Gaus- 
sian among all the elements in the one- parameter family defined above. It 
turns out that the radial function (19) approximates the Gaussian function 
well, if C(A) >> 1 holds (see figures 2 and 3). According to this observation, 
adopting C(A) as a measure of closeness to the Gaussian, we consider the 
function whose parameter A* corresponds to the maximum of C(X) : 

C(A*) = maxC(A) 
Solving the equation dC/dX = 0, we get : 

A* = ||w||/2 

Substituting A = ||w||/2, we obtain the following radial function R d — * R, 
which has a center on a unit sphere S' 1 ' 1 : 

s<r(w-x + 0) = 2__ — (21) 

l + C^llwlDexp^llx-^H 2 ) 

C(0,||w||) = exp(-(0+||w||)) 



11 




C = 5 



0.1 



C = 0.001 



Figure 2: The radial function f(r) = 1/(1 + Ce r ) associated with sigmoids 
(see eq. 19), for 3 different values of C: C = 5, 10~ : , 10~ 3 . 



12 




Gaussian 



C = 1 



0.1 



C = 0.001 



Figure 3: The gaussian function f(r) = e r and the radial functions 
/(r) = Jl^L for 5 different values of C: C = 1,10-\10- 3 . The role of 
the scale factor 1 + C is to enforce the value at the origin to be 1, to ease 
the comparison with the gaussian. 



13 



As shown in figures (2) and (3), the radial function (21) is quite similar to 
Gaussian functions if C satisfies C ^> 1, or equivalently, if the bias parameter 
satisfies 

< -||w|| (22) 

This implies that a MLP network can always simulate any Gaussian GRBF 
network well enough, if both of the networks consist of the same number of 
units (of course the MLP network has one more free parameter per hidden 
unit than a GRBF network), since 6 can always be chosen to satisfy (22). 

3.2 The Gaussian Radial Basis Function on the sphere 

Let us first consider a Gaussian function for normalized inputs. When the 
d-dimensional input vectors x 6 R d are normalized as ||x|| = 1, for any 
Gaussian function, the following relation holds: 

cexp(V||x - t|| 2 ) = c'(A) exp(-y(A) 2 ||x - t'(A)|| 2 ) 
where A € R (A 7^ 0) is an arbitrary parameter and 



c'(A) = cexp( V(l - A)(l - 1M!)), /(A) = A^ 2 , t'(A) = | 

This shows the representation cexp(— /f 2 ||x — t|| 2 ) has redundancy for nor- 
malized inputs. For example, for any Gaussian function cexp(— // 2 |jx — tj| 2 ), 
the following types of Gaussians, which are given by setting A = ||t|| and 
A = fi~ 2 , respectively, are equivalent for normalized inputs. 



cexpt-^Hx - t|| 2 ) = C 'exp(- /i 2 ||t||||x - -±-\\ 2 ) (c' = cexp(- /< 2 (l - ||t||) 2 )) 

Irll 

= c"exp(-||x - ^tH 2 ) (c" = cexp((l - // 2 )(1 - K))) 

A* 

The above indicates that the total number of free parameters for a Gaussian 
for normalized inputs is d + 1, and that we may use either Gaussians with 
normalized centers, 



14 



cexp(-// 2 ||x-t|| 2 ), ||t|| = l 
or those given by 

cexp(— ||x — 1|| 2 ) 
as basis functions for normalized inputs. 

3.3 Can Gaussian units approximate sigmoid units? 

^From the above observation, we see that MLP can be more flexible than 
Radial Basis Functions for normalized inputs, if the total number of units is 
the same, and therefore the total number of parameters is larger for MLPs. 
This is based upon one-to-one comparison of each unit of the networks. We 
expect that it may be possible to approximate the radial function (21) using 
a set of m Gaussians with the same center. 

m 

J2a j e- b ^(b j >0) (23) 



1 + Ce^ 2 jV1 

Figure (4) shows the experimental results of approximating C = 0.01,0.0001 
using 3 Gaussians for each radial function (sigmoid). The results imply the 
possibility of approximating a sigmoid by a set of Gaussians. The number 
of parameters of a sigmoid is d + 2. However, d + 2ra — 1 parameters are 
required for Gaussians to approximate a sigmoid according to (23) (thus in 
the experiments, the total number of parameters of a sigmoid and a set of 
Gaussians are d + 2 and d + 5, respectively.) In this subsection, we consider 
the possibility of approximating a sigmoid by a set of Gaussians with similar 
number of parameters. 

3.3.1 Approximation by Gaussians with Constant Scales 

In (Poggio and Girosi, 1990a) , we have extended our learning theory based 
on regularization and proposed the use of radial basis functions at multiple 
scales. To explore the possibility of approximating MLP by Multiscale GRBF 
with a similar number of parameters, we consider the following basis function 
with constant scales {Bi}™ =1 : 



15 





Figure 4: Approximation of the radial function /(r) = 1/(1+Cexp(r 2 )) (plot 
1) by superposition of 3 Gaussians (plot 2): (a) C - 0.01, (b) C = 0.0001 



16 



^ = E^ e " Bjl|x_M|2 5 (IIM = 1) (24) 

which approximate the radial function (21) 

1 



l + Cexp(^||x-t a ||2) 

In this case, the above multiscale function <f> a has d+m — 1 parameters. Since 
the center t a and the input x are normalized, the condition 

0< ||x-t a || <2 

is satisfied. Therefore the coefficients {ay} are determined solving the mini- 
mization problem : 

o ^ l + Cef r2 ~^ a J e ~ Bir2 ^ dr ~~* min ' 
and are given by the solution of the following linear system: 

t/a = v, 

Ua = [ 2 e-^ +B * 2 dr, Vi = [ \ e ~ B ' r l r , dr . 
Jo Jo 1 + C e 2 

In Fig. 5, 6 and 7 we show experimental results of the approximation 
error for (m = 3, 4, 5) (the number of parameters of each scheme is thus 
d + 2, d + 3, d + 4, respectively). The approximation errors are evaluated as : 



Jo 



dr 



(1 + Cef 1 - 2 ) 2 

In the experiments, the scales {Bj} are given by the results of experiment 
(23). The results show that, if the length of the weight is not too large, 
sigmoid for normalized inputs can be well approximated by superposition of 
Gaussian with the same number of parameters. 



17 



3 Sea I e s 



o 



0- 
^o 

ID 

Z 


.- o 
■h". 
Do 

E 
i— 

X 





o 
o 




Plot 1 
Plot 4 



Plot 2 
Plot 5 



Plot 3 

Plot 6 



Figure 5: Approximation of radial function by Multiscale Gaussian: 3 scales. 



18 



A Sea I ©s 




Plot 1 
Plot 4 



Plot 2 

Plot 5 



Plot 3 

Plot 6 



Figure 6: Approximation of radial function by Multiscale Gaussian: 4 scales. 



19 



5 Sea I e s 




Plot 1 
Plot 4 



■- Plot 2 

Plot 5 



Plot 3 
Plot 6 



Figure 7: Approximation of radial function by Multiscale Gaussian: 5 scales. 



20 



3.3.2 Approximation by Gaussians with Fixed Ratio of Scales 

In the previous subsection, we have shown that superposition of Gaussians 
could approximate sigmoid units very well for certain range of parameters 
with the same (or similar) number of parameters. In this subsection we 
consider, as another possibility of approximating any sigmoid with Mukiscale 
GRBF, the following basis function : 

<j> a = J2a aj e-^ B ^-^\ (||t a || = l) (25) 

3=1 

where ratio of scales Bj (j = 1,2,3) are constants. The total number of 
parameters of this basis function is d + 3. In Figure (8), we show some 
results of approximation experiments. In the experiments, {Bj} are given 
from the results of experiment (23) as before and {cij} are determined so 
that 



■ {- ■=— t - V" a:e~ B3r2 \ 2 dr -> min 

o 1 1 + Ce r2 fr{ 3 i 

Optimal {dj} are obtained by solving the linear equations : 

£/a = v 

1 r°° e r 

Table (1) shows the approximation errors given by : 

dr 




Jo 



{1 + Ce^y 



4 Experiments 

In the previous section, we have shown that sigmoid MLPs can always sim- 
ulate any gaussian GRBF network when both of the networks consist of the 
same number of units (sigmoids and Gaussians). The converse is also true 
if MLP's bias parameters are restricted to a certain range given by (22). To 
investigate whether this constraint is always satisfied in practice, numerical 

21 




1 1 





c. 



Figure 8: Approximation of the radial function f(r) = 1/(1 + Cexp(r 2 )) 
(plot 1) by superposition of 3 Gaussians with fixed ratio of scales (plot 2): 
(a) C = 0.1, (b) C = 0.001, (c) C = 0.00001. 



22 



c 


1 


0.1 


0.01 


0.001 


0.0001 


0.00001 


€ 


0.00165 


0.00243 


0.0171 


0.0204 


0.0162 


0.0307 



Table 1: The relative approximation error e that is obtained approximating 
a sigmoid with a multiscale GRBF in which the ratio of the variances of the 
Gaussians are kept fixed. 

experiments were carried out. Even if the original inputs are not normalized, 
it is always possible to get normalized inputs by adding one more dimension 

as follows: 

x = (x x ,...,x d ) \xi\ <£ 

x- d 2 

x->x.'=(x[,...,x , d ,x , d+1 ) x'i = — ~ (t = l,...,d) z' d+ i = l-XX 

In the experiments, we tried to approximate one dimensional functions which 
are mapped onto a (hemi)circle using the technique shown above as : 



F(x, y) = F{x) \x\<\, y~ \/I - x 2 
Functions used in our experiments are as follows (Fig (9): 

• F x = x 

• F 2 = e~ x cos(|7ra;) 

• F 3 = e~ x2 cos(|7ra;) 

jp sin(37ra:) 

• F 5 = e~ x sin(5a;) 

They were approximated with 2 dimensional MLP(sigmoids), constrained 
MLP, GRBF (Gaussians), and Multiscale Gaussians given below. 



23 



N 

I 




■1 .O 



-0.5 



O.O 



O.S 



1 .o 



Plot 1 
Plot 5 



Plot 2 



Plot 3 



Plot 4 



Figure 9: The test functions we used in our experiments: F\ = x (plotl), 
F 2 = e-* 2 cos(f7rx) (plot2), F 3 = e"* 2 cos(§7rz) (plot3), F 4 = a ^^ (plot4), 
F 5 = e~ x sin(5:r) (plot5). 



24 



• MLP (Sigmoids) 

K 1 
/(*> V) = 22 c a a(w a • x + a ), a(u) = — - 

• Constrained MLP (Sigmoid) 

K 1 

/(z,3/) = £c a cr(w a -x + a ), q = -(-|| Wq || 2 + 1) 

a=l 4 

C(^,|K||) = exp(i(||w Q ||-2) 2 )>l 



• GRBF (Gaussians) 



K 



f(^,y) = X!) c o ex pHl x -MI 2 ) 

• Multiscale GRBF 

A' m 

/(z>y) = HIT c a ^exp(-^||x - t a || 2 ) 

Approximation performances were compared according to normalized L 2 and 
Loo measured on both training set and evaluation set as follows. 

Ef =1 (F(x,)-/(x,)) 2 maxf =1 lF(x,-)-/(x 8 )i 

2 " E^X;) 2 ' ~~ rnaxf =1 |F(x,)| 



, E^(F(x p )-/(x p )) 2 maxf =1 lF(x p )-/(x p )l 

Ejlt^p) 2 ' °° max^ |F(x p )| 



where {x,}^, {xpjjjij are training set and evaluation set, respectively. In 
our experiments, these sets are randomly chosen and the number of points 
in the training set and evaluation set are N = 20 and M = 100 respectively. 



25 



In Table 2 we report the training and test errors, both in the L 2 and L^ 
norm, that we obtained applying the techniques described above to the set 
of test functions F l5 . . . , F 5 . In table 3 we report the value of C(9 a , ||w ||) 
and ||w a || for the hidden units of the MLP network. These results show 
that condition (22) is not always satisfied. To see how target functions were 
approximated by MLP (sigmoids), we show in fig. (10), (11), (12), (13) and 
(14) the solutions obtained by our experiments. 

5 Conclusions 

This paper explores the relationships between MLPs and GRBF. Its main 
point is that under the condition of normalized inputs the two representa- 
tions are very similar, though not identical. This is somewhat surprising, 
especially since MLPs anf RBF can be regarded as representative of two dif- 
ferent ways of approximating functions based, respectively on scalar products 
and euclidean distances. 

For normalized d-dimensional input vectors sigmoidal MLPs can always 
approximate essentially arbitrarily well a given GRBF (which has M less 
parameters, where M is the number of hidden units). The converse is true 
only for a certain range of the bias parameter in the sigmoidal unit. We have 
verified experimentally that in MLPs trained with backpropagation the bias 
parameters do not always respect this condition. Therefore MLPs gener- 
ated by backpropagation cannot in general be approximated arbitrarily well 
by GRBFs with the same number of hidden units (and d less parameters). 
GRBFs that have 3 Gaussians per center and therefore the same number of 
parameters as a MLP network with the same number of hidden units yield 
a reasonable approximation of a given sigmoid MLP but, again, not for all 
parameters values. Within the framework of this paper, there seems to be 
no simple answer to the question of what is the relation between MLPs and 
radial HyperBF with full or diagonal W (a HyperBF network with diag- 
onal W has the same number of parameters as a MLP network with the 
same number of hidden units). All this implies that MLPs network are - for 
normalized inputs - more powerful than GRBFs (and of course than RBF) 
networks with the same number of hidden units. Notice that the property of 
being more powerful is not necessarily an advantage here, since the number 
of parameters is larger (parameters to be learned are 1 per hidden unit in the 



26 



Function 


Scheme 


Units 


Parameters 


■i'oo 


L 2 


L'oo 


L' 2 


*i 


MLP 


2 


8 


0.00632 


2.0 x 10- 5 


0.00853 


2.2 x 10- & 


Constrained MLP 


2 


6 


0.00558 


3.3 x 10- 5 


0.00567 


3.3 x 10"° 


GRBF 


2 


6 


0.0111 


9.3 x 10- 5 


0.0513 


0.000212 


Multiscale GRBF 


2-2 


12 


0.00724 


5.2 x 10- b 


0.0243 


8.0 x 10-° 


F 2 


MLP 


3 


12 


0.00786 


2.6 x 10- 5 


0.00895 


5.3 x 10- 5 


Constrained MLP 


3 


9 


0.00764 


5.7 x 10- & 


0.00966 


8.8 x 10- & 


GRBF 


3 


9 


0.0122 


1.00 x 10- 4 


0.0144 


1.30 x 10-* 


Multiscale GRBF 


2-2 


i 12 


0.00579 


1.6 x 10- & 


0.00740 


2.4 x 10- s 


F 3 


MLP 


3 


12 


0.0106 


5.4 x 10- 5 


0.0236 


7.6 x 10- 5 


Constrained MLP 


3 


9 


0.0935 


0.0102 


0.158 


0.0159 


4 


12 


0.0268 


0.000257 


0.134 


0.00105 


GRBF 


3 


9 


0.0588 


0.00323 


0.0645 


0.00484 


4 


12 


0.0675 


0.00351 


0.0684 


0.00503 


Multiscale GRBF 


2-2 


12 


0.0122 


7.5 x 10- & 


0.00941 


6.4 x 10- 3 


3-2 


18 


0.0157 


9.9 x 10- 5 


0.140 


9.2 x 10" 6 


Fa 


MLP 


3 j 


12 


0.0130 


0.000130 


0.0153 


0.000312 


Constrained MLP 


3 


9 


0.340 


0.182 


0.392 


0.231 


4 


12 


0.181 


0.0512 


0.212 


0.0689 


GRBF 


3 


9 


0.325 


0.167 


0.365 


0.209 


4 


12 


0.122 


0.0283 


0.281 


0.0391 


Multiscale GRBF 


2-2 


12 


0.103 


0.00725 


0.157 


0.0114 


3-2 


18 


0.00877 


9.0 x 10- 5 


0.00862 


9.1 x 10~ s 


F 5 


MLP 


3 


12 


0.00874 


9.9 x 10- 5 


0.0579 


0.000719 


Constrained MLP 


3 


9 


0.0513 


0.00307 


0.151 


0.00743 


4 


12 


0.0375 


0.00420 


0.102 


0.00810 


GRBF 


3 


9 


0.200 


0.0740 


0.334 


0.0917 


4 


12 


0.0668 


0.0127 


0.0872 


0.0169 


Multiscale GRBF 


2-2 


12 


0.0260 


0.00112 


0.0775 


0.00210 


3-2 


18 


0.00821 


9.3 x 10-° 


0.0661 


0.000689 



Table 2: Training and test errors, both in the L 2 and L^ norm, for the set 
of test functions i*i, . . . , F$. 



27 



Ifl 



o 



o 
o 



o 
i 



o 

r- 

i 




-1 .o 



-0.5 



O.O 



0.5 



1 -O 



Plot 1 



Plot 2 



Plot 3 



Pic 



Figure 10: Approximation of the function Fi by a MLP with 2 hidden units: 
(plot 1) F-i = x, (plot2) basis 1, (plot 3) basis 2, (plot 4) result. Notice the 
complete overlapping between the original function (plot 1) and the MLP 
result (plot 4). 



28 



o 



o 
o 



o 

i 




-1 .o 



-O. 5 



O.O 



O.S 



1 .O 



Plot 1 
Plot 5 



Plot 2 



Plot 3 



Plot 4 



Figure 11: Approximation of the function F 2 by a MLP with 3 hidden units: 
(plot 1) F 2 — e~ x2 cos(fTz) , (plot 2) basis 1, (plot 3) basis 2, (plot 4) basis 3, 
(plot 5) result. Notice the complete overlapping between the original function 
(plot 1) and the MLP result (plot 5). 



29 



I 



I 



I 




•1 .o 



-0.5 



O.O 



O .5 



1 .O 



Plot 1 
Plot 4 



Plot 2 

Plot 5 



Plot 3 
Plot 6 



Figure 12: Approximation of the function F 3 by a MLP with 3 hidden units: 
(plot 1) F 3 = e~ x cos(|7ra;), (plot 2) basis 1, (plot 3) basis 2, (plot 4) basis 
3, (plot 5) basis 1 + basis 2, (plot 6) result. Notice the complete overlapping 
between the original function (plot 1) and the MLP result (plot 6). 



30 



o 
o 




-1 .o 



-0.5 



O.O 



O .5 



Plot 1 
Plot 4 



Plot 2 
Plot 5 



Plot 3 

Plot 6 



Figure 13: Approximation of the function F 4 by a MLP with 3 hidden units: 
(plot 1) F 4 = sg|£l,( p lot 2) basis 1, (plot 3) basis 2, (plot 4) basis 3, (plot 



5) basis 1 + basis 2, (plot 6) result. Notice the complete overlapping between 
the original function (plot 1) and the MLP result (plot 6). 



31 




-1 . o 



-0.5 



O.O 



O.S 



1 .O 



Plot 1 
Plot 4 



Plot 2 

Plot 5 



Plot 3 
Plot 6 



Figure 14: Approximation of the function F 5 by a MLP with 3 hidden units: 
(plot 1) F 5 = e _;r: sin(5a;), (plot 2) basis 1, (plot 3) basis 2, (plot 4) basis 3, 
(plot 5) basis 1 + basis 2, (plot 6) result. Notice the complete overlapping 
between the original function (plot 1) and the MLP result (plot 6). 



32 





(c^.i^iD.Hwxii) 


(C(tfa,||w 3 |),||w 2 ||) 


(C(MW3||),||W 3 ||) 


Fx 


(0.0893,2.03) 


(0.0509,2.14) 




F 2 


(7.67,6.30) 


(0.829,0.492) 


(1.4 x 10" 5 ,7.54) 


Ft 


(11.4,10.5) 


(0.00371,0.504) 


(1.15 x 10" 4 ,4.17) 


Fa 


(7.48,8.01) 


(1.60,8.63) 


(0.0271,7.70) 


Fs 


(1.25,2.37) 


(0.182,6.63) 


(1.2 x 10- 5 ,7.64) 



Table 3: The values of C(0 a , ||w a ||) and 
units of the MLP network (a = 1,2,3). 



|w Q || corresponding to the hidden 



case of RBFs, d + 1 in the case of GRBF and d -f 2 in the case of MLPs) and 
actual performance should be considered (see Maruyama et al., 1991, for an 
experimental comparison between different approximation schemes). 

How critical is the condition of normalized inputs for the above argu- 
ments? Mathematically there is no simple way of extending them to inputs 
that are not normalized. We have already mentioned the very recent re- 
sults of Poggio and Girosi, which show that HyperBF and some set of ridge 
function approximation schemes are regularization networks corresponding 
to two different classes of stabilizers (and therefore different priors in the 
associated Bayesian interpretation). In the context of the arguments of this 
paper, it is interesting to remark that normalized inputs have been used in 
several well-known applications, with good reported performances. NETtalk 
is one example. In NETtalk the input layer consists of 7 groups of units 
and each group contains 29 units (i.e. number of units in the input layer is 
203). Each group corresponds to a letter presented to NETtalk, and each 
units represents an alphabet (including period, blank, etc.). Therefore, when 
input letters are presented, only one unit among those of each group has the 
value "1" and the others have "0" as : 

x = (0,.-.,l,-..,0,...,0,---,l,---,0) 



group 1 



group 7 



Clearly, ||x|| 2 = const = 7 always. 

For normalized inputs it seems therefore that there is a strict relation, 
almost an equivalence, between the vector of the weights w in a MLP network 



33 



and the vector of the centers in the associated GRBF network. This fact has 
several potentially interesting consequences. The first one is that it may be 
possible to efficiently initialize the weights of a MLP. In particular, if we have 
a sufficient number of units (same as data size), we can design an optimal 
initial network, utilizing the properties of RBF and of the sigmoid-Gaussian 
quasi-equivalence. Several fast learning algorithms have been proposed for 
radial (basis) functions. They include Moody and Darken's method (Moody 
and Darken, 1989) based on the k-mean algorithm and Kohonen's LVQ. Our 
arguments imply that it should be possible to exploit these algorithms also 
for determining an initial structure of a MLP network. 

The second point has to do with the common interpretation of the weights. 
T. Sejnowski (Sejnowski and Rosenberg, 1987) writes " In NETtalk, the 
intermediate code was semidistributed - around 15 % of the hidden units 
were used to represent each letter-to-sound correspondence. The vowels and 
the consonants were fairly well segregated, arguing for local coding at a 
gross population level (something seen in the brain) but distributed coding 
at the level of single units (also observed in the brain)." Our result seem to 
suggest that the "intermediate code" may often be a collection of appropriate 
"templates" , in particular some of the examples. 

References 

[1] D. Braess. Nonlinear Approximation Theory. Springer- Verlag, Berlin, 
1986. 

[2] S.M. Carrol and B.W. Dickinson. Construction of neural nets using the 
Radon transform. In Proceedings of the International Joint Conference 
on Neural Networks, pages I— 607— I— 611, Washington D.C., June 1989. 
IEEE TAB Neural Network Committee. 

[3] G. Cybenko. Approximation by superposition of a sigmoidal function. 
Math. Control Systems Signals, 2(4):303-314, 1989. 

[4] D.L. Donoho and I.M. Johnstone. Projection-based approximation and 
a duality with kernel methods. The Annals of Statistics, 17(1):58— 106, 
1989. 



34 



[5] R.L. Eubank. Spline Smoothing and Nonparametric Regression, vol- 
ume 90 of Statistics, textbooks and monographs. Marcel Dekker, Basel, 

1988. 

[6] J.H. Friedman and W. Stuetzle. Projection pursuit regression. Journal 
of the American Statistical Association, 76(376):817-823, 1981. 

[7] K. Funahashi. On the approximate realization of continuous mappings 
by neural networks. Neural Networks, 2:183-192, 1989. 

[8] F. Girosi and T. Poggio. Networks and the best approximation property. 
Biological Cybernetics, 63:169-176, 1990. 

[9] R.P. Gosselin. Nonlinear approximation using rapidly increasing func- 
tions. Constr. Approx., 2:253-261, 1986. 

[10] R. Hecht- Nielsen. Theory of backpropagation neural network. In Pro- 
ceedings of the International Joint Conference on Neural Networks, 
pages I-593-I-605, Washington D.C., June 1989. IEEE TAB Neural 
Network Committee. 

[11] C.R. Hobby and J.R. Rice. Approximation from a curve of functions. 
Arch. Rat. Mech. Anal., 27:91-106, 1967. 

[12] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward net- 
works are universal approximators. Neural Networks, 2:359-366, 1989. 

[13] P.J. Huber. Projection pursuit. The Annals of Statistics, 13(2):435-475, 
1985. 

[14] C. A. Micchelli. Interpolation of scattered data: distance matrices and 
conditionally positive definite functions. Constr. Approx., 2:11-22, 1986. 

[15] J. Moody and C. Darken. Fast learning in networks of locally-tuned 
processing units. Neural Computation, l(2):281-294, 1989. 

[16] T. Poggio and F. Girosi. A theory of networks for approximation and 
learning. A.I. Memo No. 1140, Artificial Intelligence Laboratory, Mas- 
sachusetts Institute of Technology, 1989. 



35 



mmn^fVHU'mmimm 



[17] T. Poggio and F. Girosi. Networks for approximation and learning. 
Proceedings of the IEEE, 78(9), September 1990. 

[18] T. Poggio and F. Girosi. Hyperbf: A powerful approximation technique 
for learning. In Patrick H. Winston and Sarah A. Shellard, editors, 
Artificial Intelligence at MIT: Expanding Frontiers, Vol. 1. M.I.T. Press, 
Cambridge, MA, 1990a. 

[19] T. Poggio and F. Girosi. A theory of networks for learning. Science, 
247:978-982, 1990a. 

[20] M. J. D. Powell. Radial basis functions for multivariable interpolation: 
a review. In J. C. Mason and M. G. Cox, editors, Algorithms for Ap- 
proximation. Clarendon Press, Oxford, 1987. 

[21] J.R. Rice. The approximation of functions, Vol. 1. Addison-Wesley, 
Reading, MA, 1964. 

[22] J.R. Rice. The approximation of functions, Vol. 2. Addison-Wesley, 
Reading, MA, 1969. 

[23] L.L. Schumaker. Spline functions: basic theory. John Wiley and Sons, 
New York, 1981. 

[24] T. J. Sejnowski and C. R. Rosenberg. Parallel networks that learn to 
pronounce english text. Complex Systems, 1:145-168, 1987. 

[25] M. Stinchcombe and H. White. Universal approximation using feedfor- 
ward networks with non-sigmoid hidden layer activation functions. In 
Proceedings of the International Joint Conference on Neural Networks, 
pages I-607-I-611, Washington D.C., June 1989. IEEE TAB Neural 
Network Committee. 

[26] G. Wahba. Splines Models for Observational Data. Series in Applied 
Mathematics, Vol. 59, SIAM, Philadelphia, 1990. 



36