Motivation for Multiple Layers
In two points:
-
Escape the bound of linear separability
-
Much (much much…) more effective in practice
Demotivation: The Universal Approximation Theorem
It is mathematically shown that one hidden layer with an arbitrarily large number of hidden nodes is sufficient to solve any problem. So, in theory, we never really need more than two layers (one hidden, one output) because we can approximate any smooth functional mapping using a linear combination of localised sigmoidal functions. Of course, in practice, this is rarely the case since “arbitrarily large number of nodes” is usually a non feasible number. Thus, we rely on depth instead of width of Networks. More information why this works can be found in this nice but challenging read: Why does deep and cheap learning work so well?
Motivation for Back Propagation algorithm
The forward phase of MLP is the same as the normal Perceptron, so not much is changed in this aspect. Going backwards though is tricky. We know the error in the output layer and which node in the output layer was wrong but what about the hidden layers? Which weights in the hidden layer were wrong and how wrong? What error to we propagate back?
This is where the infamous Back-propagation of the error comes into play. The goal is to send the error backwards through the network and update the weights based on the final predictions.
The error function for Perceptron was simply the sum of errors of each neuron
Si=1^N (y_i - t_i)
The problem with this error function is that if we have one positive and one negative error, their sum will be 0. To get around this, all errors need to have the same sign, and the most effective way is using sum-of-squares error:
Si=i^N = ½ S (y_i - t_i)^2
By differentiating a function, it tells us the gradient of the function which is the direction which it increases and decreases the most. When we differentiate the error function we can tell in which direction the error decreases and increases. Of course, in our case we want to minimise the error so we will follow the path to the lowest point (gradient descent). The only thing that can change during training are the weights, so this is what we differentiate with respect to.
Another problem arises then. The activation function used in Perceptron (threshold / step function), is discontinuous, meaning it cannot be differentiated. So we need an activation function that still has a threshold which controls if the neuron will fire or not but can also be differentiated. For this reason, we use a family of functions called sigmoid, which also have nice derivative forms. Another popular sigmoid function is tanh with the different of being defined in [-1, 1] instead of [0, 1]. This way when we change the weights we will always do it in the direction where the error decreases.
High-level approach to Backprop
- Forward: An input vector is passed into the input nodes and the inputs are fed forward through the network. The inputs the and the first layer weights are used to decide whether the hidden nodes will fire or not. The output of the neurons and the second layer weights are used to decide if the output neurons will fire or not
- Error: The error is computed as the sum-of-squares difference between the network outputs and the targets.
- Backward: This error then is fed backwards through the network and the weights are updated based on it’s gradient.
- Update the output layer weights based on the predictions and the input of the output layer
- Update the hidden layers based on the inputs and their output
Perceptron vs Delta Rule vs Generalized Delta Rule
- PRC is for thresholded units (computes the error of the output from a step function) whereas DR is for linear units as well (using the linear combination of the inputs)
- PRC is guaranteed to converge to a solution if the data are linearly separable
- DR will reach converge to a limit of a solution. Also it doesn’t have the limitation of the linearly separable data but it doesn’t guarantee a solution
- In MLP PRC doesn’t work because it requires to know in which direction the weights should be changed to come closer to the solution (requires to know the inputs and targets of each layer)
- In MLP DR doesn’t work because it requires to know the error before we threshold which is only possible in the last layer
Thus in MLP we use Generalized DR. Choose a cost / error function \(\epsilon = \sum (t-y)^2\) and then minimize it using steepest descent by calculating its gradient with respect to the weights \(\frac{d\epsilon}{dw_{ij}}\)
Initializing weights
The values should be small because the sigmoid is saturated in values close to 0 and 1, so it should be somewhere in between where the gradient is most active (top of the bell curve). We should take into consideration that the activation function takes input from all the weights of the neuron.
Approaches to different problems
MLP is designed for batch. Presenting all of the training examples, computing the average sum-of-squares error and using this to update the weights. This means that the weights are changed once at each iteration. This also means that the weights moce in the directiong that most of the inputs wants them to move, rather than being pulled by each input individually. The batch method performs a more accurate estimate of the error gradient and thus converges to a local minimum more quickly. Batch makes a better estimate of the steepest descent direction. On the other hand, sequential updates are less likely to get stuck in local minima.
-
Selecting number of inputs: It depends on the number of features/dimensions and if one feature/dimensions has more than one value.
-
Selecting number of outputs: If its regression and we want to predict just the next value it could be just one. If its classification it can be either a one-hot encoding (one node per class) or binary representation (with n nodes we can represent 2^n nodes)
-
For binary classification: the sigmoid function in the output nodes works fine since it outputs either 0 or 1.
-
For classification problems with multiple classes: The output neurons work as a one hot encoding (1-of-N output encoding). For this the most commonly used function is soft-max. It rescales the outputs by calculating the exponential of the inputs to that neuron and dividing by the total sum of the inputs to all of the neurons, so that the activations sum to 1 and all lie between 0 and 1. Hard-max would be the same without scaling. Another way, less practical and popular way, would be to use linear output nodes with specific thresholds on the activation function.
-
For regression problems: we need a continuous range. One way would be to replace only the output nodes with linear nodes which would just sum the weighted input, meaning y= g(h) = h.
-
For sequence: We can translate sequences so that multiple sensors correspond to a sequence that can then be classified (multivariate time series). Then the number of inputs of the NN would be: number of dimensions / features x size of sequence. For example, to predict whether a sequence of length 5 (5 readings of a day to classify mood of the day) based on 3 features (heart rate, blood pressure, body temperature) then the inputs would be 3x5 = 15.
Local Minima
The whole force of the Neural Network is the learning rule which drives the network error to a minimum value through gradient descent. By following the slope of the error downhill, the plan is to reach its bottom. But the error space might have multiple slopes and thus multiple local minima, so the network guarantees to reach only a local minima. Which minimum we end up in depends on where we start in the error space. Start close to a global minimum and you will end up there, start close to a local and it’s going to be difficult to find the global.
Problems of backprop!
-
Can get stuck in local minima
-
Vanishing (or exploding) gradients
-
Slow convergence
-
Many parameters to tune
-
Bad scaling for large problems
Optimization!
This is where the most fun is! Using a simple MLP approach with none of the methods below can be very unstable, so many different steps and combinations of these approaches can be implemented to lead to better and faster training.
Mini batches
Splitting the training set into random batches and estimate the descent based on those at each time. By shuffling the data at each time we hope that the optimisation has the chance to escape local minima due to possible variety of directions from each minibatch.
Stochastic Gradient Descent
Use just one piece of the data at each iteration of the algorithm to calculate the gradient and pick this piece uniformly at random from the training set. Often used the the training set is very large and it is expensive to use the whole dataset to estimate the gradient.
Momentum
One way to solve this would be to use the same idea as momentum in physics when a ball rolls down a hill. This way it will general average to follow the correct direction without being held back by small local minima. Usually, with momentum it is possible to use smaller learning rate, which makes the learning more stable. The addition to the learning rule is + a Δw_(t-1), meaning momentum a (=0.9) multiplied with the value of the previous update. Momentum can also be viewed as variable learning rate
Learning rate decay
(make large scale changes at the beginning when the weights are random)
Cyclical Learning rates
Instead of using an Ensemble method to train multiple models you can have multiple instances of one model by using different learning rates during training. You start with a high learning rate and you decrease it to (eta_min1) then you increase it again to (eta_max1) and then start decreasing it again to (eta_min2) etc etc. The weights of the model at each time the learning rate reached eta_min can correspond to one model.
Regularisation
An important concept to prevent overfitting and control gradients. Penalise learning by adding a penalty term in the error function. This leads to a smoother solution. Prior knowledge can be embedded in the type of regularization. Moreover, mall weights are better since they lead the network that is closer to linear (since they are close to zero, they are in the region where the sigmoid is increasing linearly) and only those weights that are essential to the non-linear learning should be large. Thus, the goal is to favour small weights. The most popular approach is weight decay with L1 (lasso regression) or L2 (ridge regression).
Add random noise to the training examples so the model learns a more general representation
Early Stopping
We start training the network and then after each epoch we use the validation set to estimate how well the model is generalising. When the validation error starts to increase, it means that the the network starts to overfit, and it is time to stop training.
Dropout
During training turn off some neurons at given times with a probability 1-p. Then during testing we need to compensate by scaling the activations by p.
The Bayesian Framework
Bayesian Inference: Using a prior distribution over the parameters of the model we calculate a posterior distribution which captures the evidence of these parameters. Based on prior distribution of weights, the likelihood distribution and the evidence of the model we calculate a posterior distribution of the weights using Maximum A Posteriori for the weights The model with the highest evidence then will have a good balance between maximum likelihood and low complexity (satisfying the regularizing term)
High-level approach
- Initialize model parameters alpha and beta
- Initialize weights based on prior distributions
- Train NN with gradient descent to minimise M(w)
- Update parameters alpha and beta every few iterations
But why?
- Intuitive interpretation of regularization
- No need for validation data for parameter selection
- Computationally effective to find high number of parameters
- Confidence intervals for predictions
- Compare different models only through training data
- Rules for network growing / pruning
- Provide objective framework to deal with complexity issues
Ensemble Methods
The idea is to combine many weak learners that each has slightly different results on the dataset and by putting them together we will have significantly better results than what each of them has independently. How you combine them is crucial to get good results. One simple way is majority voting.
Simple averaging Generalized committee with weighted combinations AdaBoost
Bagging (Bootstrap AGGregatING): Combining classifiers
Bootstrapping is a sampling technique where a sample is drawn uniformly and with replacement from the dataset, meaning we may get a sample multiple times and miss some other ones. We create multiple datasets, each one using bootstrapping, which have the same size as the original dataset and then train one model per dataset. The benefit of doing that is to create multiple learners that will perform slightly differently. The bagging method then is to just feed the same input to each classifier and get the output based on majority voting. When using bagging for regression, it is common to take the median of the outputs, since the mean is heavily affected by outliers. This method is called bragging (robust bagging).
Subagging is another variance of bagging where instead of creating new dataset with the same size of the original one using the bootstrapping technique, we make smaller datasets (usually half the size) by subsampling without replacement.
Boosting
Collection of many very poor learners (just a bit better than chance) and put them together in order to create one strong learner. The principal algorithm of boosting is AdaBoost. The main idea this algorithm is based on, is to give specific weights to datapoints depending on how difficult previous classifiers found it to get it correct. Then these weights are combined with the input during training and are fed to the next classifier.
High-level approach
-
Initially, all weights have the same value 1/N (where N=#datapoints)
-
At each iteration:
- compute error ε as the sum of the weights of misclassified points
- update the weights only for incorrect examples by multiplying them with (1-ε) / ε
- normalise the whole set so it sums to 1 (this way the importance of correctly classified datapoints is being reduced)
Comparison Bagging vs Boosting
- Bagging ensures different classifiers see different data
- Boosting gives different importance to specific datapoints
- Bagging takes a majority vote
- Boosting takes a weighted vote
Static approaches that do not account for the input: Ensemble averaging + bagging + Boosting
Approaches dependent on the input: Mixture of experts + Hierarchical mixtures
Don’t forget!
-
Due to the randomness of weight initialisation there always needs to be a statistical analysis to find the performance of the model.
-
Watch out for imbalanced datasets! In classification if the training data are comprised by 90% of one class, then if the network is set to only output this one class, it will be 90% of the times correct. To tackle this, you may need to discard data from the over-represented class. Or in binary classification you can use novelty detection, meaning train only with the under-represented class (negative) and then classify whatever doesn’t seem like this to be in the other class (positive).
Sources
-
The Bayesian Framework https://authors.library.caltech.edu/13793/1/MACnc92b.pdf
-
Lecture 1-4 “Multi-layer perceptrons: learning and generalisation” Slides from Artificial Neural Networks and Deep Architectures DD2437 @ KTH Spring 2019
-
Marsland, Stephen. Machine learning: an algorithmic perspective. Chapman and Hall/CRC, 2011.
-
Ruder, Sebastian. “An overview of gradient descent optimization algorithms.” arXiv preprint arXiv:1609.04747 (2016).
-
Smith, Leslie N. “Cyclical learning rates for training neural networks.” 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2017.