The Loss Surface of Multilayer Networks

Kushal Shah
7 min readMar 13, 2024

--

Artificial Neural Networks [ANNs] have become essential ingredients of all Artificial Intelligence algorithms. What we call Deep Learning and Artificial Intelligence nowadays is basically about taking a suitable Neural Network architecture, and training it using a large amount of suitable data to be able to make good predictions. Now while this process has led to remarkable results, we understand very little about the training process itself. And the culprit is non-convexity of the loss surface of a typical ANN (also called Feedforward Network or Multilayer Network)! The good part is that, surprisingly, this ignorance does not hamper our ability to use ANNs for making predictions.

Source : https://ics.uci.edu/~xhx/courses/CS206/

Both Linear and Logistic Regression are “nice” algorithms in the sense that they lead to convex loss functions, which can be easily minimised using gradient descent. Linear Regression is even nicer and provides a closed form solution for the free parameters through a matrix formulation. But in the case of ANNs, we get a non-convex loss function which can have multiple local minima, which may or may not be equivalent. And so when we do gradient descent, we are likely to get stuck in one of the local minima and never reach the global minima! The question is, why are ANNs still so popular?

Parameter Interaction in ANNs

To see why ANN loss function is non-convex, consider the simple ANN above with a single input node, single hidden node, and single output node. Even the activation function of the hidden node here is taken to be linear, and so the input-output relationship is given by a simple linear relation. Even this simple ANN will lead to a non-convex loss surface since the output only depends on the product of the two free parameters and so one could essentially choose one of the values arbitrarily as long as the product is the same. This interaction between parameters in an ANN with even a single hidden layer leads to non-convexity.

Of course, if we remove the hidden layer, then the network reduces to the case of a single neuron, which leads to the convex loss surface of Logistic Regression. So the point is that all we need is a single hidden layer to cause the loss function to become non-convex. And if this hidden layer has a nonlinear activation function, the overall loss function can become non-convex in a variety of ways.

To see that the above simple neural network has a non-convex loss function, we can write a simple python code and plot its loss function in the vicinity of origin.

# IMPORT THE REQUIRED PACKAGES

import numpy as np
from random import random
import plotly.graph_objs as go

# INITIALISE THE INPUT, OUTPUT VARIABLES
# CREATE A GRID OF THE PARAMETER VALUES

x_list = [0.01*i for i in range(100)]
y_list = [100.0*x + 100.0*random() for x in x_list]

w1_list = np.array([-5 + i*0.1 for i in range(100)])
w2_list = w1_list.copy()

w_len = len(w1_list)

# COMPUTE THE LOSS SURFACE VALUE ON THE PARAMETER VALUE GRID

loss_matrix =np.zeros((len(w1_list),len(w2_list)))
for i in range(len(w1_list)):
w1 = w1_list[i]

for j in range(len(w2_list)):
w2 = w2_list[j]

loss = 0.0
for k in range(len(x_list)):
x = x_list[k]
y = y_list[k]
loss += (y - w2*w1*x)**2

loss_matrix[i,j] = loss/w_len**2

# PLOT THE SURFACE USING PLOTLY

surface = go.Surface(x=w1_list, y=w2_list, z=loss_matrix)

layout = go.Layout(
scene=dict(
xaxis=dict(title='w1'),
yaxis=dict(title='w2'),
zaxis=dict(title='loss'),
)
)

fig = go.Figure(data=[surface], layout=layout)
fig.show()

The output of this code is shown in the figure below and we can clearly see that there is a saddle point at the origin. A saddle point is a point where the first derivative of the loss function is zero with respect to all the weight parameters, but it is neither a local minima nor a local maxima. In the graph below, we can see that the loss function has zero first derivative in both directions at the origin, but it is curving upwards in one direction and curving downwards in another due to which it becomes a saddle point. A convex function cannot have a saddle point by definition. So we now have a clear demonstration that even a simple neural network with a single hidden node can have a non-convex loss surface.

Symmetry of the parameters

Another source of non-convexity in ANNs is symmetry between the parameters. If we take a dense ANN, i.e. a network in which every node of a layer is connected to every node of the previous layer, when swapping some of the parameters corresponding to some of the suitably chosen edges makes no difference to the final calculation due to symmetry property of the dense ANN. Hence, if a certain combination of parameters is known to lead to a local minima, there will be many other combinations obtained simply by suitable permutations that will also lead to a local minima.

In this case, one can argue that all of these local minima will be equivalent and give a loss function value same as the global minima. While it may be true in some cases, we cannot say that it will be true for all ANNs. In general, the shape of the loss surface of multilayer ANNs is not well understood and still an active area of research, as very well explained in this 2015 paper co-authored by Yann LeCun.

What happens if we don’t reach global minima?

Conventional wisdom says that if we have an optimization problem, then we must reach its global minima since that is the best possible solution to the given problem. However, the situation is quite different in the case of Deep Learning since reaching the global minima can lead to overfitting which in turn leads to poor generalisation properties. There are several regularisation techniques designed in the Machine Learning domain whose specific goal is to prevent the algorithm from reaching the global minima in order to avoid overfitting.

Due to this reason, we don’t really care if our optimization algorithm has reached the global minima or not. And in fact, we would like to avoid reaching that global minima point due to concerns of overfitting. What we really need to care about is whether both our training and testing accuracies are good enough or not.

Is the loss function shape irrelevant?

Not really! Even though we do not care to find the global minima during the ANN training process, the actual shape of the loss function does impact model performance.

As it turns out, if we take a shallow ANN, its loss function has several local minima which can have widely different performance on the train and test sets. This is also why ANNs earned a bad reputation in its early days in the 1980s and 1990s since shallow networks had very poor prediction properties and researchers did not have the data or resources to train large networks.

As the availability of both data and computational power increased, it was realised that larger ANNs may still have multiple local minima, but the performance of each of these local minima on the test set is usually very similar. Also, these local minima are relatively easy to find using methods like Stochastic Gradient Descent [SGD]. In other words, even though these larger ANNs have numerous local minima, it does not bother ML researchers since its usually enough to reach any one of them without worrying about how many other local minima are there or how far they are from the global minima. Remember, all these are empirical results and we still do not have strong theoretical understanding of the ANN loss surface.

Is the ANN training process a trivial affair?

The previous section can given an impression that training a large ANN is a trivial affair since its easy to reach the local minima, and we don’t care to find the global minima. Now it is important to note that these local minima can also lead to overfitting, especially if the network size is large. Hence, one has to be careful while using these on a given dataset.

Also, while Stochastic Gradient Descent [SGD] is in general very powerful for finding local minima during the ANN training process, the standard version can have several limitations for certain kinds of datasets and networks. To address these issues, ML researchers have proposed numerous variations like RMSprop, Adam, etc which have been found to be very useful. More on this in another blog!

Summary

Training ANNs is a complex affair due to non-convexity of their loss function but for large enough networks, SGD and its variants have been generally found to converge to one of the numerous local minima with reasonably good performance. Unlike other “nice” ML algorithms like Linear or Logistic Regression, in the case of ANNs we don’t care to reach the global minima since it can lead to overfitting.

--

--

Kushal Shah

Now faculty at Sitare University. Studied at IIT Madras, and earlier faculty at IIT Delhi. Join my online LLM course : https://www.bekushal.com/llmcourse