0) Motivation, Object and Related works:
Motivation:
Clustering is among the most fundamental tasks in machine learning and artificial intelligence.
Objectives: VaDE
A novel unsupervised generative clustering approach within the framework of Variational Auto-Encoder (VAE).
Specifically, VaDE models the data generative procedure with a Gaussian Mixture Model (GMM) and a deep neural network (DNN):
The GMM picks a cluster;
From which a latent embedding is generated;
Then the DNN decodes the latent embedding into an observable.
Inference in VaDE is done in a variational way: a different DNN is used to encode observables to latent embeddings, so that the evidence lower bound (ELBO) can be optimized using the Stochastic Gradient Variational Bayes (SGVB) estimator and the reparameterization trick.
Outperforms the state-of-the-art clustering methods on 5 benchmarks from various modalities.
An Overview of the VaDE architecture
Overview:
VaDE models the data generative procedure with a Gaussian Mixture Model and a deep neural network by the following steps:
The Gaussian Mixture Model picks a cluster
From this picked cluster a latent embedding is generated
The DNN decodes the latent embedding into observables
The inference step in VaDE is done using a variational method, in which a different DNN is used to encode observables to latent embeddings, such that the evidence lower bound (ELBO) can be optimized using Stochastic Gradient Variational Bayes estimator.
Motivation behind this work:
Learn good representations that capture the statistical structure of the data, and be capable of generating samples.
We can leverage the capabilities of VaDE to be able to generate the face of a person based on certain features that we want in our generated sample.
A sample application of VaDE in which features of faces are combined to generate a new unique sample [2]
Generative models are capable of generating unique samples after sufficiently trained, however there is not much information on the statistical structure of the data, and this is where VaDE really shines. It combines a generative architecture with the ability to cluster data points.
Background:
Gaussian Mixture Model
A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of gaussian distribtuions with unknown parameters. [3]
Using a concept called soft clustering which means data points belong to multiple probability density functions.
Each Gaussian k in the mixture is comprised of the following parameters:
A mean μ that defines its center.
A covariance Σ that defines its width. This would be equivalent to the dimensions of an ellipsoid in a multivariate scenario.
A mixing probability π that defines how big or small the Gaussian function will be.
We can graphically display these parameters as shown below
From the three Gaussian Functions we can see that K=3, and each gaussian explains data contained in each of the clusters.
To derive the Gaussian Mixture Model we need to find the probability that a data point x comes from Gaussian k:
Latent Variables
A Latent Variable model aims to model the probability distribution with hidden information behind the model.
Inference is the inverse of generation and vice versa
Prior Distribution: p(z) models the behaviors of the latent variables
Likelihood: p(x|z) defines how to map latent variables to data points
Joint Distribution p(x,z) = p(x|z)p(z), is the multiplication of the likelihood and the prior which describes the model
Marginal Distribution p(x): The distribution of the original data, the marginal distribution tells us how possible it is to generate a data point.
Posterior Distribution p(z|x): Describes the latent variables that can be produced by a specific data point.
To Generate a data point, we can sample z from p(z) and then sample the data point x from p(x|z)
Alternatively, we can infer a latent variable from sample x given p(x) and then infer a latent variable z from p(z|x)
This then leads us to the question of how we can find all of these distributions of the latent variables. To begin to answer this question we must consider the Bayes rule. Which tells us that we can build each type of model as a combination of the other probabilities.
Variational Autoencoder
The variational autoencoder network is the same as a Autoencoder network but uses the concept of a latent space, or a vector of latent variables to encode information into the model. To begin to train the latent variables we use Maximum Likelihood Estimation (MLE). The MLE is a technique of estimating the parameters of a probability distribution such that the distribution fits the observed data. [6] The MLE can be described mathematically as:
The MLE for the marginal distribution pθ(x) cannot be solved analytically, however we can repose this problem and solve it using gradient descent. Once solved we can derive model parameters θ which model the desired probability distribution. In order to apply gradient descent, we need to calculate the gradient marginal log-likelihood function. Using calculus and Bayes’ Rule we can solve for the marginal log-likelihood function.
Now we need to solve for the Posterior distribution, in other words solve for the inference portion of our model. Since the above posterior distribution for pθ(z∣x) is intractable we must use variational inference to approximate this distribution. We can use another distribution qϕ(z∣x) called the variational posterior, to approximate the actual posterior.
Variational Autoencoder
The variational autoencoder network is the same as a Autoencoder network but uses the concept of a latent space, or a vector of latent variables to encode information into the model. To begin to train the latent variables we use Maximum Likelihood Estimation (MLE). The MLE is a technique of estimating the parameters of a probability distribution such that the distribution fits the observed data. [6] The MLE can be described mathematically as: