A few notes on unsupervised learning
The goal is to find similarities between inputs. We cant do regression since we dont know the target but maybe in some cases we can do classification through clustering.
-
Supervised learning is driven by the error. But when we dont have a target we cannot calculate an error and drive the learning directly from it. Additionally, supervised learning is task specific. Meaning it learns by specific datasets which it tries to minimize the error.
-
Unsupervised learning is more broad in the sense that it is usually about correlating the input space with the weight space.
A very simple but popular approach: k-means
Motivation for k-means
-
Noise reduction: If the data are corrupted or have a lot noise, by finding the correct clusters we can identify the noise and remove it. Although be careful, since the mean average, which is a crucial part of k-means, is very sensitive to outliers. To tackle this instead of the mean, we can use the median, which is much more effective when dealing with noise, but also a lot more computationally heavy.
-
Data compression: Find cluster centres in order to reduce the number of datapoints that we send.
The algorithm:
Given a dataset we want to divide it into k categories / clusters. The goal is to adjust the value of each k centre so it represents best each cluster. The problem is we dont know where the clusters are (there are no labels) and we certainly do not know where their middle is. This is what we want to find out. Compute the mean point (μ) of each cluster and put the centre there.
-
Distance measure: Usually we use the Euclidean distance to measure distance between points.
-
Mean average: If we have the distance between two points then its easy to compute their central point which is the mean average, when we are in a Euclidean space. When the space isnt Euclidean different measures are needed.
Minimising the euclidean distance isnt that different than trying to minimise the sum-of-squares error then.
We randomly place in the space the cluster centres.
Until convergence:
For each datapoint:
Compute the distance to each cluster centre
Assign the datapoint to the closest cluster centre
For each cluster centre:
Calculate the mean of the points that have been assigned to that cluster
Move the centre to that point
The testing is utilizing only phase 1
Some notes on k-means
-
K-means is very susceptible to local minima. It really depends where the centres are initialized. To tackle this, we randomly initialize the centres in different positions and run the algorithm multiple times. To overcome the problem of overfitting (having a centre for each datapoint) a validation set can be used.
-
K-means and Neural Nets: The cluster centres could represent the neurons of a network in the weight space, where we try to optimise how well each neuron represents the data.
Competitive learning
We can then train a neural network with k-means! Have only one layer of competitive neurons with no bias node that will try to compete on which one will be activated. Only one of the neurons will finally activate which will be the one with the highest activation. This method is called winner-takes-all activation.
On-line k-means Neural Net
First, let’s think about something. We cannot use the actual k-means algorithm to update the weights since we often feed the network with just one input (on-line) instead of the whole dataset and thus we cannot compute the mean.
An abstract approach:
- Choose k neurons and fully connect the inputs to the neurons
- Normalise the inputs so that their absolute size is the same
- Use a linear transfer function (simply the product of the weights and inputs)
Vector quantisation
When we compress data we need to accept that we might have to lose a bit of accuracy. When we have a unique vector that we want to send but want to compress it, we will send the a close representation of it, meaning not all of the datapoints will be able to be reconstructed (some information will be lost). This is called vector quantisation and is what happens with lossy compression.
In order to minimise then the loss of compression we need to find the best prototype vector which is as close as possible to all of our possible inputs.
Self Organising Feature Maps (SOM)
The self organising feature map algorithm is the most popular competitive learning algorithm. It demonstrates relative ordering preservations, also known as topology preservation. But this isn’t always possible preserve this topology since for a higher dimensionality than 2D, the order cannot be preserved.
-
Feature mapping: the relative position of the neurons in the network is crucial since nearby neurons will correlate to similar input patterns.
-
Neuron grid (not layer!): All neurons are arranged in a grid and the connections between them follow the structure of the grid (instead of layers with connections between the layers)
-
Lateral connections: Connections between neurons within a layer.
As we said, we want that when one neuron fires, its neighboring neurons to be affected. A winning neuron then needs to pull other neurons that are close it, closer to itself in the weight space (positive connections). But also, to repel neurons that are further away (negative connections) and just ignore neurons that are already further away since they represent different features. This leads to the Mexican Hat form of lateral connections.
- Neighbourhood connections: Start with a large neighbourhood size but reduce it while learning in order to fine tune the parameters. First get a rough ordering and then optimise local regions of the network. The learning rate and the size of the neighbourhood are slowly decreases over each iteration.
The algorithm:
Phase 1: Initialisation
-
Choose k neurons and d dimensions for the map
-
Initialise randomly the weights so that they all have different values
OR
- Set the weight values to increase in the direction of the first d principal components of the dataset (use PCA). Find the two largest directions of variation in the data and initialise the weights so that they increase along these two directions. This will take care of the ordering part of the algorithm before it even begins. If we use this method, then the size of the neighbourhood can be small from the beginning. This method also requires to have the whole dataset at hand (batch mode) and will not work on-line.
Phase 2: Learning
For each datapoint
-
Select the neuron which is closest (minimum euclidean distance between the weights and the input)
-
Update the weight vector of the best-matching neuron
-
Update the weight vector of all other neurons
-
Reduce the learning rate and adjust the neighbourhood function
The testing is utilizing only phase 1
3 steps of initialization:
-
Competition: mapping a continuous input space onto a discrete space of units
-
Cooperation: lateral interaction through topographic neighborhood
-
Weight adaptation
General notes
-
Topological ordering: First high learning rate and big neighborhood size to find the approximate global ordering
-
Convergence: Then small learning rate and small neighborhood size to fine tune local ordering
-
Dead units problem: undesirable units that never get updated in the learning process with a given dataset. In competitive learning if there is a case where some neurons are further away from the data, then there will be specific neurons that get update all the time and others (the ones further away) that are not getting updated as often. This will lead them to constantly have their values reduced, which will then in return make them less likely to win.
-
Fixes to the dead units problem:
-
Add noise to data
-
Initialize weight values to data samples
-
Soft Learning / Leaky Learning
-
Learning with conscience
-
Boundaries: In some cases we might want to remove the boundaries conditions. This can be done by tying the ends together (1D line -> Circle, 2D rectangle -> torus). The calculations then usually depend on modulo calculations.
-
The size of the SOM needs to be defined beforehand and it will be this (the number of neurons) which will define how precise (fine-grained) the learning is. The correct size is found with trial and error.
Too small -> underfitting / too simple / general. Too big -> overfitting
- By using local interactions only, we end up with a global ordering (think of flock of birds)
Sources:
-
Lectures 5-6 “Competitive learning, self-organising maps” Slides from Artificial Neural Networks and Deep Architectures DD2437 @ KTH Spring 2019
-
Marsland, Stephen. Machine learning: an algorithmic perspective. Chapman and Hall/CRC, 2011.