# Full text of "mit :: ai :: aim :: AIM-1291"

## See other formats

MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL INFORMATION PROCESSING WHITAKER COLLEGE A.I. Memo No. 1291 April 1992 C.B.I.P. Paper No. A Connection between GRBF and MLP Minoru Maruyama, Federico Girosi and Tomaso Poggio Abstract Both multilayer perceptrons (MLP) and Generalized Radial Basis Functions (GRBF) have good approximation properties, theoretically and experimentally. Are they re- lated? The main point of this paper is to show that for normalized inputs, multilayer perceptron networks are radial function networks (albeit with a non-standard radial function). This provides an interpretation of the weights w as centers t of the radial function network, and therefore as equivalent to templates. This insight may be useful for practical applications, including better initialization procedures for MLP. In the remainder of the paper, we discuss the relation between the radial functions that cor- respond to the sigmoid for normalized inputs and well-behaved radial basis functions, such as the Gaussian. In particular, we observe that the radial function associated with the sigmoid is an activation function that is good approximation to Gaussian basis functions for a range of values of the bias parameter. The implication is that a MLP network can always simulate a Gaussian GRBF network (with the same number of units but less parameters); the converse is true only for certain values of the bias parameter. Numerical experiments indicate that this constraint is not always satisfied in practice by MLP networks trained with backpropagation. Multiscale GRBF net- works, on the other hand, can approximate MLP networks with a similar number of parameters. © Massachusetts Institute of Technology, 1992 This paper describes research done within the Center for Biological Information Processing, in the Department of Brain and Cognitive Sciences, and at the Artificial Intelligence Laboratory. This research is sponsored by a grant from the Office of Naval Research (ONR), Cognitive and Neural Sci- ences Division; by the Artificial Intelligence Center of Hughes Aircraft Corporation; by the Alfred P. Sloan Foundation; by the National Science Foundation. Support for the A. I. Laboratory's artificial intelligence research is provided by the Advanced Research Projects Agency of the Department of Defense under Army contract DACA76-85-C-0010, and in part by ONR contract N00014-85-K-0124. 1 Introduction Techniques and networks for learning from examples may be considered as methods for approximating multivariate functions from sparse data (the "ex- amples"). In recent papers we have developed one such technique that we have called regularization networks, of which Radial Basis Functions are a special case and Hyper Basis Functions are the most general form (see for a review Poggio and Girosi, 1990 and references therein). The underlying theory is quite well developed and understood: for instance the role of the hidden units in the corresponding network (see Fig. (1)) is easily explained. The method has been also demonstrated to work well in a number of practical applications. Another technique, which is extremely popular, is associated with multilayer perceptrons, typically used in conjunction with a version of gradient descent (for estimating the parameters) called backpropagation. In this paper, we will consider multilayer perceptrons with one hidden layer and linear output units. These networks have been used successfully in many cases of learning from examples. The underlying theory is less developed, though a few theoretical results have been obtained in recent months. In particular, it has been proved that MLP have what we call the Weierstrass property, that is they can approximate arbitrarily well - provided enough units are available - any continuous function on a compact interval (Hornik, Stinchcombe and White, 1989; Stinchcombe and White, 1989; Carroll and Dickinson, 1989; Cybenko, 1989; Funahashi, 1989). Regularization networks also have this property, shared by many other approximation schemes (Girosi and Poggio, 1989). It is natural to explore the relation between the two techniques, especially because the corresponding networks have superficially a similar structure, both having one hidden layer of units as shown by Fig. (1). The network of Fig. (1) represents the class of functions of the type /(x) = f># t (x) (1) where Hi are functions that depend on some parameters to be estimated together with the coefficients c t -. When the functions Hi are kept fixed, the function /(x) is linear in its parameters (the c*), and the resulting approxi- mation scheme is linear. Figure 1: The most general network with one layer of hidden units. Here we show the two-dimensional case, in which x = (x,y). Each function Hi can depend on a set of unknown parameters, that are computed during the learning phase, as well as the coefficients C;. When Hi = <r(w; • x + #,) the network is a multilayer perceptron; when Hi = h(x — t,-) the network is a regularization network for appropriate choices of h. Depending on the function //, different approximation schemes can be obtained. Two common choices are the following: 1. Regularization networks, that correspond to the choice: ff t -(x) = A(l|x-t,-||w) where h is a (conditionally) positive definite function, the t; are d- dimensional vectors called "centers", W is a d x d matrix, and we have defined the weighted norm \\ • ||\y as ||x||$v = x r W T Wx . We call this scheme Hyperbf. RBF is the case in which the centers coincide with the data points, and GRBF (Generalized Radial Basis Functions) is the case in which the centers t a are free parameters to be estimated but the weight matrix W is fixed and equal to the identity matrix. This class of approximation schemes can be formally derived, in the framework of regularization theory, from a variational principle which has a Bayesian interpretation. It includes: • kernel estimation techniques and Parzen windows • splines • Hyperbf and RBF 2. Ridge functions approximation: H i {x) = h i (x-w i + 6 i ) where the w t - are rf-dimensional vectors called "weights", and the pa- rameters $i constitutes bias terms. This form of approximation did not have until now any variational or Bayesian interpretation. Very recent results by Poggio and Girosi (unpublished) show that a set of ridge function approximation schemes can be derived as limits of regulariza- tion networks for an appropriate class of stabilizers. Several techniques are included in this class: • Projection Pursuit Regression (PPR): it corresponds to an expansion of the type eq. (1) in which all the coefficients c t - are set to 1, and Hi(x) = hi(x • w,-) where the vectors W; are normalized (||w;|| = 1). The functions hi are determined by means of some nonparametric estimation technique, in the following iterative way: (a) Assume that we already have k — 1 terms. Let fe-i 3=1 be the residual of the approximation; (b) Search for the next term. Calculate the following sum of the residulas N J2( r i - 9k{*i • Wfc)) 2 i=l and find the direction w^ which minimize the sum of residuals and the corresponding function h^. • Flexible Fourier series: the function hi are all equal to the cosine (or sine) function, and therefore: Hi(x) = cos(x -Wi + Oi) . (2) If we assume that the function g underlying the data has a Fourier transform g(s) = \g(s)\e %e ( s \ then <?(x) = 3? Qk <*s e isx |s(s)|e' e ( s >) = J^ ds \~g(s)\cos(s-x+0(s)) , (3) and the expansion of eq. (1) with the choice (2) looks like a cu- bature formula for the integral above, where the vectors W; are the points at which the integrand is sampled. In this case the interpretation of the "weights" w,- is clear: they represent the fundamental frequency content of the underlying function. It can also be noticed that the problem of finding the optimal param- eters of the parametric function is equivalent to the problem of finding the optimal sample points to use in the discretization of the integral above. Multilayer Perceptrons: the functions hi are all equal to a sigmoidal function: Hi(x) = <r(x • w t - + Oi) . (4) The function er is usually defined as <y(x) — but other choices, (as the hyperbolic tangent), are equivalent, as long as the sigmoidal shape is mantained. While in the approximation by trigonometric functions the pa- rameters w,- have a simple interpretation in terms of frequencies, in this case the meaning of the w,- is less clear. If the expansion (4) is considered from the point of view of projection pursuit re- gression, (Friedman and Stuetzle, 1981; Huber, 1985) the Wj are the "relevant directions" that are supposed to encode the most information about the function. Exponential sums: when the problem is one dimensional, a well known non linear technique consists in approximation by ex- ponential sums (Braess, 1986; Gosselin, 1986; Hobby and Rice, 1967), and a number of results are available in this case. In more than one dimension, the natural extension corresponds to ridge function approximation with the choice: Hi(x) = e~ xw > . (5) Notice that the bias terms disappear, since they are absorbed by the coefficients. We group the ridge function approximation schemes together because they all have the same general form: linear combination of nonlinear function of the scalar product of the input vector with the parameter vector. The main difference between these two classes of approximation schemes (which, as we mentioned earlier, can be both derived from regularization in terms of two different classes of stabilizers, reflecting different prior assump- tions x ) seems to be related to the use of the scalar product x • w,- instead of the weighted Euclidean distance ||x — t,||\y, as argument of the parametric functions h. At first sight, these two broad classes of techniques do not seem to have any relationship. The main point of this paper is to show that in some special situations there is a close connection between these two classes of approx- imation schemes and in particular between Gaussian GRBF (i.e. Hyperbf networks with W = /) and MLP with sigmoidal units in the hidden layer. 2 Normalized Inputs: Ridge Functions are Radial Functions In the case in which all the inputs variables are normalized, that is they lie on the unit d- dimensional sphere, there is a simple connection between ridge and radial functions. In fact the following identity ||x-t|| 2 = ||x|| 2 + ||t|| 2 -2x.t (6) for ||x|| = 1 becomes: ||x-t|| 2 = l + ||ti| 2 -2x -t (7) We can now use identity (7) to obtain a ridge regression scheme from a Radial Basis Functions scheme and vice versa. 1 The formulation of the learning problem in terms of regularization is satisfying from a theoretical point of view, since it establishes connections with a large body of results in the area of Bayes estimation and in the theory of approximation of multivariate functions. • From radial basis functions to multilayer perceptrons. Substituting identity (7) in the Radial Basis Functions expansion /(x) = £> Q tf(||x-t Q || 2 ) (8) we obtain N /(x)=£c Q tf(x.t a + a ), (9) where we have denned H(x) = H(-2x), Q = -I(i + || ta || 2 ). (10) We notice that eq. (10) is the expansion corresponding to a multilayer perceptron network. The only difference is that while in the multilayer perceptron network the bias parameter 9 a is allowed to vary along the real line, in this case it is costrained to lie in the interval (— oo, —|]. We therefore can conclude that, when inputs are normalized, given the RBF network of eq. (8) with activation function H it is always possible to define a multilayer perceptron with the same number of units and with activation function H that computes the same function. The synaptic weights connecting the input and the hidden layer are the centers of the Radial Basis Functions expansion, and the bias parameters are uniquely determined by the synaptic weights. • From multilayer perceptron to radial basis functions. In the previous case we have seen that Radial Basis Functions can be simulated by multilayer perceptron. This is not surprising since a Radial Basis Functions unit has one parameter less than a multilayer perceptron unit. For the same reason we cannot expect to simulate a multilayer perceptron unit with d inputs in terms of a Radial Basis Functions unit with the same number of inputs. However this may be possible if we add to the Radial Basis Functions units a dummy input, so that the coordinate of the center corresponding to the dummy variable gives the missing parameter. Let us consider the multilayer perceptron expansion: TV /(x) = £ c Q <r(x ■ w a + 6 a ) . (11) a=l Using identity (7) we obtain: /(X)=£;^(-I||X- W a || a + </*), (12) a=l Z where we have defined d a = \(i + \\™ a \\ 2 ) + e a We now rewrite the expansion as a Radial Basis Functions expansion in d + 1 dimensions, in which one of the input variables is kept fixed, and equal to some number (1, for simplicity). Introducing the d + 1 dimensional vectors x = (x, 1) , w a = (w,t>«) . we can rewrite the expansion (12) as /(x) = £ c Q <7(-i||x - w a || 2 + 1(1 - v a f + d a ) . (13) Expansion (13) becomes a Radial Basis Functions expansion if one of the two following conditions is satisfied: 1. There exists v a such that 1 l(l-v a ) 2 = -d a ; (14) 2. There exist functions g x and g 2 such that a {~2 X + y) = 9i{x)92(y) ■ (15) In the first case, in fact, the multilayer perceptron expansion becomes /(x) = f> a /K||x-w a || 2 ) (16) where we have defined the function h(x) — a{— \x). In the second case the multilayer perceptron expansion becomes /(x) = E C Ui(ll*-w„|| 2 ). (17) where c' a = g 2 (d a + |(1 - v Q ) 2 ). Remarks In case (1), a solution to eq. (14) does not exist unless the condi- tion -(l + ||w a || 2 ) + a <O is satisfied. This condition on the weights w Q and on the bias parameter 6 a defines a subclass of multilayer perceptrons that can be simulated by a Radial Basis Function network with the same number of units and a dummy input, and therefore the same number of parameters. In case (2), evaluating eq. (15) first at x = and y = we obtain that , . <t(— \x) a(y) 92{0) gi(0) and after some algebra ,(, + ,)-^. (18, It is well known that this functional equation has only one class of solutions, given by cr(x) — ce x , Vc € R . Therefore radial units can simulate arbitrary multilayer percep- tron units only in the case in which the activation function of the multilayer perceptron network is an exponential function. In this case the corresponding Radial Basis Functions network is a Gaus- sian network. In fact, setting o(x) = e x in eq. (13) we obtain the expansion N /(x) = £^H |ix - Wa||2) a=l Notice also that in this case there is no need of dummy input, since the bias term can be dropped out by a redefinition of the coefficients, due to the property (18) of the exponential function. 3 Sigmoid MLPs and Gaussian GRBFs In this section, we compare the sigmoid and Gaussian function for normal- ized input vectors ||x|| = 1. Under these conditions the sigmoidal function becomes a radial function, that can approximate any Gaussian function well enough within a certain range of the bias parameter 6. For normalized inputs, any ridge functions is equivalent to an appropriate radial function. Consider the sigmoid function given by <r(w • x + 0) = — -i -— weR^deR . 1 4- exp(— (w • x + 9)) The corresponding radial function parametrized by A £ ft is : 10 s < - - -,<* s<t{w ■ x + 9) = — -— -— • (s € R, w € R d , 6 € R) 1 + exp(-(w • x + 0)) (C(X) > 0) (19) l + C(A)exp(A|jx-t|n where we have defined : t = ^, C(X) = exp(-(0 + A(l + Ml))) (20) 3.1 Sigmoids units can approximate Gaussian units The existence of free parameter A indicates that any element of the one- parameter family given by (19) is equivalent to each other for normalized inputs. To compare Gaussians and the radial functions associated with sig- moids, we should measure the discrepancy between the two functions. For the purpose of the comparison, we first take the closest function to a Gaus- sian among all the elements in the one- parameter family defined above. It turns out that the radial function (19) approximates the Gaussian function well, if C(A) >> 1 holds (see figures 2 and 3). According to this observation, adopting C(A) as a measure of closeness to the Gaussian, we consider the function whose parameter A* corresponds to the maximum of C(X) : C(A*) = maxC(A) Solving the equation dC/dX = 0, we get : A* = ||w||/2 Substituting A = ||w||/2, we obtain the following radial function R d — * R, which has a center on a unit sphere S' 1 ' 1 : s<r(w-x + 0) = 2__ — (21) l + C^llwlDexp^llx-^H 2 ) C(0,||w||) = exp(-(0+||w||)) 11 C = 5 0.1 C = 0.001 Figure 2: The radial function f(r) = 1/(1 + Ce r ) associated with sigmoids (see eq. 19), for 3 different values of C: C = 5, 10~ : , 10~ 3 . 12 Gaussian C = 1 0.1 C = 0.001 Figure 3: The gaussian function f(r) = e r and the radial functions /(r) = Jl^L for 5 different values of C: C = 1,10-\10- 3 . The role of the scale factor 1 + C is to enforce the value at the origin to be 1, to ease the comparison with the gaussian. 13 As shown in figures (2) and (3), the radial function (21) is quite similar to Gaussian functions if C satisfies C ^> 1, or equivalently, if the bias parameter satisfies < -||w|| (22) This implies that a MLP network can always simulate any Gaussian GRBF network well enough, if both of the networks consist of the same number of units (of course the MLP network has one more free parameter per hidden unit than a GRBF network), since 6 can always be chosen to satisfy (22). 3.2 The Gaussian Radial Basis Function on the sphere Let us first consider a Gaussian function for normalized inputs. When the d-dimensional input vectors x 6 R d are normalized as ||x|| = 1, for any Gaussian function, the following relation holds: cexp(V||x - t|| 2 ) = c'(A) exp(-y(A) 2 ||x - t'(A)|| 2 ) where A € R (A 7^ 0) is an arbitrary parameter and c'(A) = cexp( V(l - A)(l - 1M!)), /(A) = A^ 2 , t'(A) = | This shows the representation cexp(— /f 2 ||x — t|| 2 ) has redundancy for nor- malized inputs. For example, for any Gaussian function cexp(— // 2 |jx — tj| 2 ), the following types of Gaussians, which are given by setting A = ||t|| and A = fi~ 2 , respectively, are equivalent for normalized inputs. cexpt-^Hx - t|| 2 ) = C 'exp(- /i 2 ||t||||x - -±-\\ 2 ) (c' = cexp(- /< 2 (l - ||t||) 2 )) Irll = c"exp(-||x - ^tH 2 ) (c" = cexp((l - // 2 )(1 - K))) A* The above indicates that the total number of free parameters for a Gaussian for normalized inputs is d + 1, and that we may use either Gaussians with normalized centers, 14 cexp(-// 2 ||x-t|| 2 ), ||t|| = l or those given by cexp(— ||x — 1|| 2 ) as basis functions for normalized inputs. 3.3 Can Gaussian units approximate sigmoid units? ^From the above observation, we see that MLP can be more flexible than Radial Basis Functions for normalized inputs, if the total number of units is the same, and therefore the total number of parameters is larger for MLPs. This is based upon one-to-one comparison of each unit of the networks. We expect that it may be possible to approximate the radial function (21) using a set of m Gaussians with the same center. m J2a j e- b ^(b j >0) (23) 1 + Ce^ 2 jV1 Figure (4) shows the experimental results of approximating C = 0.01,0.0001 using 3 Gaussians for each radial function (sigmoid). The results imply the possibility of approximating a sigmoid by a set of Gaussians. The number of parameters of a sigmoid is d + 2. However, d + 2ra — 1 parameters are required for Gaussians to approximate a sigmoid according to (23) (thus in the experiments, the total number of parameters of a sigmoid and a set of Gaussians are d + 2 and d + 5, respectively.) In this subsection, we consider the possibility of approximating a sigmoid by a set of Gaussians with similar number of parameters. 3.3.1 Approximation by Gaussians with Constant Scales In (Poggio and Girosi, 1990a) , we have extended our learning theory based on regularization and proposed the use of radial basis functions at multiple scales. To explore the possibility of approximating MLP by Multiscale GRBF with a similar number of parameters, we consider the following basis function with constant scales {Bi}™ =1 : 15 Figure 4: Approximation of the radial function /(r) = 1/(1+Cexp(r 2 )) (plot 1) by superposition of 3 Gaussians (plot 2): (a) C - 0.01, (b) C = 0.0001 16 ^ = E^ e " Bjl|x_M|2 5 (IIM = 1) (24) which approximate the radial function (21) 1 l + Cexp(^||x-t a ||2) In this case, the above multiscale function <f> a has d+m — 1 parameters. Since the center t a and the input x are normalized, the condition 0< ||x-t a || <2 is satisfied. Therefore the coefficients {ay} are determined solving the mini- mization problem : o ^ l + Cef r2 ~^ a J e ~ Bir2 ^ dr ~~* min ' and are given by the solution of the following linear system: t/a = v, Ua = [ 2 e-^ +B * 2 dr, Vi = [ \ e ~ B ' r l r , dr . Jo Jo 1 + C e 2 In Fig. 5, 6 and 7 we show experimental results of the approximation error for (m = 3, 4, 5) (the number of parameters of each scheme is thus d + 2, d + 3, d + 4, respectively). The approximation errors are evaluated as : Jo dr (1 + Cef 1 - 2 ) 2 In the experiments, the scales {Bj} are given by the results of experiment (23). The results show that, if the length of the weight is not too large, sigmoid for normalized inputs can be well approximated by superposition of Gaussian with the same number of parameters. 17 3 Sea I e s o 0- ^o ID Z .- o ■h". Do E i— X o o Plot 1 Plot 4 Plot 2 Plot 5 Plot 3 Plot 6 Figure 5: Approximation of radial function by Multiscale Gaussian: 3 scales. 18 A Sea I ©s Plot 1 Plot 4 Plot 2 Plot 5 Plot 3 Plot 6 Figure 6: Approximation of radial function by Multiscale Gaussian: 4 scales. 19 5 Sea I e s Plot 1 Plot 4 ■- Plot 2 Plot 5 Plot 3 Plot 6 Figure 7: Approximation of radial function by Multiscale Gaussian: 5 scales. 20 3.3.2 Approximation by Gaussians with Fixed Ratio of Scales In the previous subsection, we have shown that superposition of Gaussians could approximate sigmoid units very well for certain range of parameters with the same (or similar) number of parameters. In this subsection we consider, as another possibility of approximating any sigmoid with Mukiscale GRBF, the following basis function : <j> a = J2a aj e-^ B ^-^\ (||t a || = l) (25) 3=1 where ratio of scales Bj (j = 1,2,3) are constants. The total number of parameters of this basis function is d + 3. In Figure (8), we show some results of approximation experiments. In the experiments, {Bj} are given from the results of experiment (23) as before and {cij} are determined so that ■ {- ■=— t - V" a:e~ B3r2 \ 2 dr -> min o 1 1 + Ce r2 fr{ 3 i Optimal {dj} are obtained by solving the linear equations : £/a = v 1 r°° e r Table (1) shows the approximation errors given by : dr Jo {1 + Ce^y 4 Experiments In the previous section, we have shown that sigmoid MLPs can always sim- ulate any gaussian GRBF network when both of the networks consist of the same number of units (sigmoids and Gaussians). The converse is also true if MLP's bias parameters are restricted to a certain range given by (22). To investigate whether this constraint is always satisfied in practice, numerical 21 1 1 c. Figure 8: Approximation of the radial function f(r) = 1/(1 + Cexp(r 2 )) (plot 1) by superposition of 3 Gaussians with fixed ratio of scales (plot 2): (a) C = 0.1, (b) C = 0.001, (c) C = 0.00001. 22 c 1 0.1 0.01 0.001 0.0001 0.00001 € 0.00165 0.00243 0.0171 0.0204 0.0162 0.0307 Table 1: The relative approximation error e that is obtained approximating a sigmoid with a multiscale GRBF in which the ratio of the variances of the Gaussians are kept fixed. experiments were carried out. Even if the original inputs are not normalized, it is always possible to get normalized inputs by adding one more dimension as follows: x = (x x ,...,x d ) \xi\ <£ x- d 2 x->x.'=(x[,...,x , d ,x , d+1 ) x'i = — ~ (t = l,...,d) z' d+ i = l-XX In the experiments, we tried to approximate one dimensional functions which are mapped onto a (hemi)circle using the technique shown above as : F(x, y) = F{x) \x\<\, y~ \/I - x 2 Functions used in our experiments are as follows (Fig (9): • F x = x • F 2 = e~ x cos(|7ra;) • F 3 = e~ x2 cos(|7ra;) jp sin(37ra:) • F 5 = e~ x sin(5a;) They were approximated with 2 dimensional MLP(sigmoids), constrained MLP, GRBF (Gaussians), and Multiscale Gaussians given below. 23 N I ■1 .O -0.5 O.O O.S 1 .o Plot 1 Plot 5 Plot 2 Plot 3 Plot 4 Figure 9: The test functions we used in our experiments: F\ = x (plotl), F 2 = e-* 2 cos(f7rx) (plot2), F 3 = e"* 2 cos(§7rz) (plot3), F 4 = a ^^ (plot4), F 5 = e~ x sin(5:r) (plot5). 24 • MLP (Sigmoids) K 1 /(*> V) = 22 c a a(w a • x + a ), a(u) = — - • Constrained MLP (Sigmoid) K 1 /(z,3/) = £c a cr(w a -x + a ), q = -(-|| Wq || 2 + 1) a=l 4 C(^,|K||) = exp(i(||w Q ||-2) 2 )>l • GRBF (Gaussians) K f(^,y) = X!) c o ex pHl x -MI 2 ) • Multiscale GRBF A' m /(z>y) = HIT c a ^exp(-^||x - t a || 2 ) Approximation performances were compared according to normalized L 2 and Loo measured on both training set and evaluation set as follows. Ef =1 (F(x,)-/(x,)) 2 maxf =1 lF(x,-)-/(x 8 )i 2 " E^X;) 2 ' ~~ rnaxf =1 |F(x,)| , E^(F(x p )-/(x p )) 2 maxf =1 lF(x p )-/(x p )l Ejlt^p) 2 ' °° max^ |F(x p )| where {x,}^, {xpjjjij are training set and evaluation set, respectively. In our experiments, these sets are randomly chosen and the number of points in the training set and evaluation set are N = 20 and M = 100 respectively. 25 In Table 2 we report the training and test errors, both in the L 2 and L^ norm, that we obtained applying the techniques described above to the set of test functions F l5 . . . , F 5 . In table 3 we report the value of C(9 a , ||w ||) and ||w a || for the hidden units of the MLP network. These results show that condition (22) is not always satisfied. To see how target functions were approximated by MLP (sigmoids), we show in fig. (10), (11), (12), (13) and (14) the solutions obtained by our experiments. 5 Conclusions This paper explores the relationships between MLPs and GRBF. Its main point is that under the condition of normalized inputs the two representa- tions are very similar, though not identical. This is somewhat surprising, especially since MLPs anf RBF can be regarded as representative of two dif- ferent ways of approximating functions based, respectively on scalar products and euclidean distances. For normalized d-dimensional input vectors sigmoidal MLPs can always approximate essentially arbitrarily well a given GRBF (which has M less parameters, where M is the number of hidden units). The converse is true only for a certain range of the bias parameter in the sigmoidal unit. We have verified experimentally that in MLPs trained with backpropagation the bias parameters do not always respect this condition. Therefore MLPs gener- ated by backpropagation cannot in general be approximated arbitrarily well by GRBFs with the same number of hidden units (and d less parameters). GRBFs that have 3 Gaussians per center and therefore the same number of parameters as a MLP network with the same number of hidden units yield a reasonable approximation of a given sigmoid MLP but, again, not for all parameters values. Within the framework of this paper, there seems to be no simple answer to the question of what is the relation between MLPs and radial HyperBF with full or diagonal W (a HyperBF network with diag- onal W has the same number of parameters as a MLP network with the same number of hidden units). All this implies that MLPs network are - for normalized inputs - more powerful than GRBFs (and of course than RBF) networks with the same number of hidden units. Notice that the property of being more powerful is not necessarily an advantage here, since the number of parameters is larger (parameters to be learned are 1 per hidden unit in the 26 Function Scheme Units Parameters ■i'oo L 2 L'oo L' 2 *i MLP 2 8 0.00632 2.0 x 10- 5 0.00853 2.2 x 10- & Constrained MLP 2 6 0.00558 3.3 x 10- 5 0.00567 3.3 x 10"° GRBF 2 6 0.0111 9.3 x 10- 5 0.0513 0.000212 Multiscale GRBF 2-2 12 0.00724 5.2 x 10- b 0.0243 8.0 x 10-° F 2 MLP 3 12 0.00786 2.6 x 10- 5 0.00895 5.3 x 10- 5 Constrained MLP 3 9 0.00764 5.7 x 10- & 0.00966 8.8 x 10- & GRBF 3 9 0.0122 1.00 x 10- 4 0.0144 1.30 x 10-* Multiscale GRBF 2-2 i 12 0.00579 1.6 x 10- & 0.00740 2.4 x 10- s F 3 MLP 3 12 0.0106 5.4 x 10- 5 0.0236 7.6 x 10- 5 Constrained MLP 3 9 0.0935 0.0102 0.158 0.0159 4 12 0.0268 0.000257 0.134 0.00105 GRBF 3 9 0.0588 0.00323 0.0645 0.00484 4 12 0.0675 0.00351 0.0684 0.00503 Multiscale GRBF 2-2 12 0.0122 7.5 x 10- & 0.00941 6.4 x 10- 3 3-2 18 0.0157 9.9 x 10- 5 0.140 9.2 x 10" 6 Fa MLP 3 j 12 0.0130 0.000130 0.0153 0.000312 Constrained MLP 3 9 0.340 0.182 0.392 0.231 4 12 0.181 0.0512 0.212 0.0689 GRBF 3 9 0.325 0.167 0.365 0.209 4 12 0.122 0.0283 0.281 0.0391 Multiscale GRBF 2-2 12 0.103 0.00725 0.157 0.0114 3-2 18 0.00877 9.0 x 10- 5 0.00862 9.1 x 10~ s F 5 MLP 3 12 0.00874 9.9 x 10- 5 0.0579 0.000719 Constrained MLP 3 9 0.0513 0.00307 0.151 0.00743 4 12 0.0375 0.00420 0.102 0.00810 GRBF 3 9 0.200 0.0740 0.334 0.0917 4 12 0.0668 0.0127 0.0872 0.0169 Multiscale GRBF 2-2 12 0.0260 0.00112 0.0775 0.00210 3-2 18 0.00821 9.3 x 10-° 0.0661 0.000689 Table 2: Training and test errors, both in the L 2 and L^ norm, for the set of test functions i*i, . . . , F$. 27 Ifl o o o o i o r- i -1 .o -0.5 O.O 0.5 1 -O Plot 1 Plot 2 Plot 3 Pic Figure 10: Approximation of the function Fi by a MLP with 2 hidden units: (plot 1) F-i = x, (plot2) basis 1, (plot 3) basis 2, (plot 4) result. Notice the complete overlapping between the original function (plot 1) and the MLP result (plot 4). 28 o o o o i -1 .o -O. 5 O.O O.S 1 .O Plot 1 Plot 5 Plot 2 Plot 3 Plot 4 Figure 11: Approximation of the function F 2 by a MLP with 3 hidden units: (plot 1) F 2 — e~ x2 cos(fTz) , (plot 2) basis 1, (plot 3) basis 2, (plot 4) basis 3, (plot 5) result. Notice the complete overlapping between the original function (plot 1) and the MLP result (plot 5). 29 I I I •1 .o -0.5 O.O O .5 1 .O Plot 1 Plot 4 Plot 2 Plot 5 Plot 3 Plot 6 Figure 12: Approximation of the function F 3 by a MLP with 3 hidden units: (plot 1) F 3 = e~ x cos(|7ra;), (plot 2) basis 1, (plot 3) basis 2, (plot 4) basis 3, (plot 5) basis 1 + basis 2, (plot 6) result. Notice the complete overlapping between the original function (plot 1) and the MLP result (plot 6). 30 o o -1 .o -0.5 O.O O .5 Plot 1 Plot 4 Plot 2 Plot 5 Plot 3 Plot 6 Figure 13: Approximation of the function F 4 by a MLP with 3 hidden units: (plot 1) F 4 = sg|£l,( p lot 2) basis 1, (plot 3) basis 2, (plot 4) basis 3, (plot 5) basis 1 + basis 2, (plot 6) result. Notice the complete overlapping between the original function (plot 1) and the MLP result (plot 6). 31 -1 . o -0.5 O.O O.S 1 .O Plot 1 Plot 4 Plot 2 Plot 5 Plot 3 Plot 6 Figure 14: Approximation of the function F 5 by a MLP with 3 hidden units: (plot 1) F 5 = e _;r: sin(5a;), (plot 2) basis 1, (plot 3) basis 2, (plot 4) basis 3, (plot 5) basis 1 + basis 2, (plot 6) result. Notice the complete overlapping between the original function (plot 1) and the MLP result (plot 6). 32 (c^.i^iD.Hwxii) (C(tfa,||w 3 |),||w 2 ||) (C(MW3||),||W 3 ||) Fx (0.0893,2.03) (0.0509,2.14) F 2 (7.67,6.30) (0.829,0.492) (1.4 x 10" 5 ,7.54) Ft (11.4,10.5) (0.00371,0.504) (1.15 x 10" 4 ,4.17) Fa (7.48,8.01) (1.60,8.63) (0.0271,7.70) Fs (1.25,2.37) (0.182,6.63) (1.2 x 10- 5 ,7.64) Table 3: The values of C(0 a , ||w a ||) and units of the MLP network (a = 1,2,3). |w Q || corresponding to the hidden case of RBFs, d + 1 in the case of GRBF and d -f 2 in the case of MLPs) and actual performance should be considered (see Maruyama et al., 1991, for an experimental comparison between different approximation schemes). How critical is the condition of normalized inputs for the above argu- ments? Mathematically there is no simple way of extending them to inputs that are not normalized. We have already mentioned the very recent re- sults of Poggio and Girosi, which show that HyperBF and some set of ridge function approximation schemes are regularization networks corresponding to two different classes of stabilizers (and therefore different priors in the associated Bayesian interpretation). In the context of the arguments of this paper, it is interesting to remark that normalized inputs have been used in several well-known applications, with good reported performances. NETtalk is one example. In NETtalk the input layer consists of 7 groups of units and each group contains 29 units (i.e. number of units in the input layer is 203). Each group corresponds to a letter presented to NETtalk, and each units represents an alphabet (including period, blank, etc.). Therefore, when input letters are presented, only one unit among those of each group has the value "1" and the others have "0" as : x = (0,.-.,l,-..,0,...,0,---,l,---,0) group 1 group 7 Clearly, ||x|| 2 = const = 7 always. For normalized inputs it seems therefore that there is a strict relation, almost an equivalence, between the vector of the weights w in a MLP network 33 and the vector of the centers in the associated GRBF network. This fact has several potentially interesting consequences. The first one is that it may be possible to efficiently initialize the weights of a MLP. In particular, if we have a sufficient number of units (same as data size), we can design an optimal initial network, utilizing the properties of RBF and of the sigmoid-Gaussian quasi-equivalence. Several fast learning algorithms have been proposed for radial (basis) functions. They include Moody and Darken's method (Moody and Darken, 1989) based on the k-mean algorithm and Kohonen's LVQ. Our arguments imply that it should be possible to exploit these algorithms also for determining an initial structure of a MLP network. The second point has to do with the common interpretation of the weights. T. Sejnowski (Sejnowski and Rosenberg, 1987) writes " In NETtalk, the intermediate code was semidistributed - around 15 % of the hidden units were used to represent each letter-to-sound correspondence. The vowels and the consonants were fairly well segregated, arguing for local coding at a gross population level (something seen in the brain) but distributed coding at the level of single units (also observed in the brain)." Our result seem to suggest that the "intermediate code" may often be a collection of appropriate "templates" , in particular some of the examples. References [1] D. Braess. Nonlinear Approximation Theory. Springer- Verlag, Berlin, 1986. [2] S.M. Carrol and B.W. Dickinson. Construction of neural nets using the Radon transform. In Proceedings of the International Joint Conference on Neural Networks, pages I— 607— I— 611, Washington D.C., June 1989. IEEE TAB Neural Network Committee. [3] G. Cybenko. Approximation by superposition of a sigmoidal function. Math. Control Systems Signals, 2(4):303-314, 1989. [4] D.L. Donoho and I.M. Johnstone. Projection-based approximation and a duality with kernel methods. The Annals of Statistics, 17(1):58— 106, 1989. 34 [5] R.L. Eubank. Spline Smoothing and Nonparametric Regression, vol- ume 90 of Statistics, textbooks and monographs. Marcel Dekker, Basel, 1988. [6] J.H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of the American Statistical Association, 76(376):817-823, 1981. [7] K. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2:183-192, 1989. [8] F. Girosi and T. Poggio. Networks and the best approximation property. Biological Cybernetics, 63:169-176, 1990. [9] R.P. Gosselin. Nonlinear approximation using rapidly increasing func- tions. Constr. Approx., 2:253-261, 1986. [10] R. Hecht- Nielsen. Theory of backpropagation neural network. In Pro- ceedings of the International Joint Conference on Neural Networks, pages I-593-I-605, Washington D.C., June 1989. IEEE TAB Neural Network Committee. [11] C.R. Hobby and J.R. Rice. Approximation from a curve of functions. Arch. Rat. Mech. Anal., 27:91-106, 1967. [12] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward net- works are universal approximators. Neural Networks, 2:359-366, 1989. [13] P.J. Huber. Projection pursuit. The Annals of Statistics, 13(2):435-475, 1985. [14] C. A. Micchelli. Interpolation of scattered data: distance matrices and conditionally positive definite functions. Constr. Approx., 2:11-22, 1986. [15] J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, l(2):281-294, 1989. [16] T. Poggio and F. Girosi. A theory of networks for approximation and learning. A.I. Memo No. 1140, Artificial Intelligence Laboratory, Mas- sachusetts Institute of Technology, 1989. 35 mmn^fVHU'mmimm [17] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9), September 1990. [18] T. Poggio and F. Girosi. Hyperbf: A powerful approximation technique for learning. In Patrick H. Winston and Sarah A. Shellard, editors, Artificial Intelligence at MIT: Expanding Frontiers, Vol. 1. M.I.T. Press, Cambridge, MA, 1990a. [19] T. Poggio and F. Girosi. A theory of networks for learning. Science, 247:978-982, 1990a. [20] M. J. D. Powell. Radial basis functions for multivariable interpolation: a review. In J. C. Mason and M. G. Cox, editors, Algorithms for Ap- proximation. Clarendon Press, Oxford, 1987. [21] J.R. Rice. The approximation of functions, Vol. 1. Addison-Wesley, Reading, MA, 1964. [22] J.R. Rice. The approximation of functions, Vol. 2. Addison-Wesley, Reading, MA, 1969. [23] L.L. Schumaker. Spline functions: basic theory. John Wiley and Sons, New York, 1981. [24] T. J. Sejnowski and C. R. Rosenberg. Parallel networks that learn to pronounce english text. Complex Systems, 1:145-168, 1987. [25] M. Stinchcombe and H. White. Universal approximation using feedfor- ward networks with non-sigmoid hidden layer activation functions. In Proceedings of the International Joint Conference on Neural Networks, pages I-607-I-611, Washington D.C., June 1989. IEEE TAB Neural Network Committee. [26] G. Wahba. Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia, 1990. 36