t-SNE
The t-SNE method reduces the dimensionality of data, and in the process, keep points close if they are close in the high-dimensional space. This is a good way of finding clusters in high-dimensional data.
The t-SNE method reduces the dimensionality of data, and in the process, keep points close if they are close in the high-dimensional space. This is a good way of finding clusters in high-dimensional data.
0) Motivation, Object and Related works:
Motivation:
Points, which are close to one another in the high-dimensional dataset, will tend to be close to one another in the chart.
Chart leads to better insights compared to other basic statistics.
Objectives:
t-SNE (t-distributed stochastic neighbor embedding for long) is a machine learning technique for dimensionality reduction that helps identify relevant patterns.
Main advantage: Preserve local structure.
Produce good-looking visualization for high-dimensional datasets in simple charts.
How it works:
t-SNE models the probability distribution of neighbors around each point. The term "neighbors" refers to the set of points that are closest to each point.
In the original, high-dimensional space this is modeled as a Gaussian distribution.
In the 2-dimensional output space, this is modeled as a t-distribution.
Goal: Find a mapping onto the 2-dimensional space that minimizes the differences between these two distributions over all points. The fatter tails of a t-distribution compared to a Gaussian help to spread the points more evenly in the 2-dimensional space.
The main parameter controlling the fitting is called perplexity.
Perplexity is roughly equivalent to the number of nearest neighbors considered when matching the original and fitted distributions for each point.
A low perplexity means we care about local scale and focus on the closes other points.
High perplexity takes more of a "big picture" approach.
t-SNE only works with the data it is given. It does not produce a model that you can then apply to new data.
Variants:
t-SNE (van der Maaten and Hinton 2008) is a widely-used method of visualizing high-dimensional data in low dimensions. It is motivated by minimizing the Kullback-Leibler divergence between the distributions of pairwise affinities among observations in the high-dimensional and low-dimensional spaces.
Its predecessor, SNE (Hinton and Roweis 2002), uses a Gaussian kernel to transform the low-dimensional distances into affinities, while t-SNE uses a heavier-tailed t-distribution with one degree of freedom. As noted by van der Maaten and Hinton (2008), the heavier tails of the t-distribution compared to the Gaussian distribution help to alleviate the “crowding problem” so that distinct blobs appear in the low-dimensional embedding.
Since t-SNE gives better-separated clusters than SNE, Kobak et al. (2019) used the Fourier transform (FFT)-accelerated interpolation-based t-SNE (FIt-SNE) approximation from Linderman et al. (2019) to implement a fast version of t-SNE for smaller, fractional degrees of freedom.
t-SNE has also been shown to cluster well-separated data reliably in any embedding dimension (Linderman and Steinerberger 2019)
Implementation:
from sklearn.manifold import TSNE
References:
StatQuest: t-SNE, Clearly Explained [Link]
TensorFlow Visualization: https://www.tensorflow.org/tensorboard/tensorboard_projector_plugin