Motivation for Hopfield Networks
The Hebbian rule, the rule that motivated the idea of how the brain neurons work, broadly stated that neurons that wire together, fire together, indicating a symmetry. However, the common feed-forward neural networks (MLP) have a directed flow of information, meaning the value of the neuron B is dependent on the value of neuron A, but not the other way around, which means that the connection between them is asymmetric. The Hopfield networks then are based exactly on the Hebbian’s rule principle by having symmetric weights. This feature seems to be exceptionally useful for the task of creating an associative memory, aka context-addressable memory.
Inside a Hopfield network
This type of memory works by learning a set of patterns and then when a similar pattern is introduced, it reproduces the pattern which resembles it the most. Not only is this useful for identifying already seen objects but also completing or correcting missing or corrupted patterns, also known as pattern completion and pattern denoising respectively.
How it works
-
A Hopfield networks consists of a set of simple McCulloch and Pitts neurons that are fully connected with each other and have symmetric weights.
-
The diagonal in the weight matrix which indicates a connection to itself is set to zero as we do not want to include self-connections.
-
The neurons are binary (bistable), meaning they output -1 or 1. It is more convenient to use -1 and 1 instead of 0 and 1 because when we multiply them in the activation function they indicate directly if the weight should be increased (positive sign if the same react in the same way) or decreased (negative sign if the weights have opposite behaviours).
-
The neurons can be of any number but they are more like a grid rather than layer since there are connections between all the neurons. The weight matrix will be square and w_ij = w_ji
Training (Hebbian Learning rule)
w_ij = 1 / N ∑ x_i * x_j
Recall (Finding an attractor)
- Synchronous Update: First way is to force all neurons to update their value simultaneously with a matrix multiplication at each timestep.
Iterate until s_t = s_(t-1): s_i_t = sign(∑(w_ij s_j_(t-1)))
-
Asynchronous Update: Second way is to update each neuron (w_ij) one by one depending on the values of all the other neurons at a specific time step. This requires many iterations since the neurons to be updated can be either picked randomly or in a specific sequence and they also need to converge in a steady state.
S_i = sign(∑(w_ij s_j))
We can either use the function sign which is like a step function (returns 1 if 0> and -1 otherwise) or we can use a bias node with a constant value (+1 or -1) and use it as usual in a simple MLP. Where N=#patterns to learn and s_i(n) the activation of neuron i for input pattern n. By using 1/N (which takes the place of the learning rate) the maximum size of the weights is independent of N.
Important notes!
-
Asynchronous update always converges to a local minimum.
-
Synchronous update might not be able to converge.
-
Attractors: Given an input, if the network is able to stabilize (the values of the weights stop changing), then this state (pattern) is an attractor in the network. The inverse of a pattern is also an attractor.
-
Energy Function: The idea of the energy in a system (network in our case) is taken from physics, with the concept that a system with a low energy is a stable system. Similarly in a network, if many weights and neurons disagree then the energy will be high and this will indicate that there is a significant mismatch between the values of the network and what they weights should be. So when a network is stable, the energy will be at a minimum. Then the attractors can be seen as local minima of the error function. The basin of attraction (memory cue) is the hollow area around the attractor which can enable to model to converge to it. In discrete Hopfield networks, the energy space is discrete. The energy function can be especially useful when comparing stability between models and their rate of convergence.
-
Capacity: Moreover, the more patterns we try to store in the model, the more local minima we create. It turns out (after some math) that the capacity of a Hopfield network can be about N=0.138d patterns where d is the number of neurons in the network. (Small note: this number is not taking into consideration the inverse patterns of the ones already stored.) However, this number indicates the maximum capacity and in reality the patterns that can be stored are less than that. This is because this number assumes that the patterns are orthogonal whereas in reality this is not always the case. Intuitively, this means that if we try to store patterns that look alike (e.g different faces), we are not utilizing the whole weight space and it will be harder for the network to distinguish between the different patterns.
-
Sparse and orthogonal inputs promote larger capacity
-
Continuous Hopfield Networks: In order to make the hopfield networks be able to compute continuous values (e.g grayscale images) we need to change the activation function so it can output values between -1 and 1. The most commonly used function for this job in hopfield networks is the hyperbolic tangent function. In the same spirit, the original Boltzmann machines are very similar.
Common problems
-
Corruption of individual bits
-
Lack of encoded memory / very small basin of attraction
-
Spurious attractors: Unwanted attractors (patterns) that have been embedded in the matrix in a non explicit manner but are still stable. Local minima that we don’t want. Usually they have higher energy and smaller basin.
Sources:
-
Lecture 8 “Hopfield networks and introduction to stochastic networks” Slides from Artificial Neural Networks and Deep Architectures DD2437 @ KTH Spring 2019
-
Marsland, Stephen. Machine learning: an algorithmic perspective. Chapman and Hall/CRC, 2011.
-
A step-by-step analytical example (beware the yellow!) http://web.cs.ucla.edu/~rosen/161/notes/hopfield.html