1) Overview:
Supervised vs Unsupervised Learning
Taxonomy of Generative Models
Fully Visible Belief Network (FVBN)
Variational Autoencoders (VAE)
2) Details:
Supervised Learning
Data: (x, y) - x is data, y is label
Goal: Learn a function to map x -> y
Examples: Classification, regression, object detection, semantic segmentation, image captioning, etc.
Unsupervised Learning
Data: x - Just data, no labels!
Goal: Learn some underlying hidden structure of the data
Examples: Clustering, dimensionality reduction, feature learning, density estimation, etc.
Generative Modeling
Given training data, generate new samples from same distribution.
Formulate as density estimation problems:
Explicit density estimation: explicit define and solve pmodel(x)
Implicit density estimation: learn model that can sample from pmodel(x) without explicity defining it.
Taxonomy of Generative Models
Fully Visible Belief Network (FVBN)
An explicit density model.
Use Chain rule to decompose likelihood of an image x into product of 1-d distributions.
Then, maximize likelihood of training data.
Complex distribution over pixel values ==> Express using a neural network.
Some Explainations of FVBNs: [Link]
What distinguishes the FVBNs from the other explicit density models are that they have a tractable density: that is, that you can exactly calculate the probability of samples of your dataset (or at least, that's the assumption made by the model). This is what's meant by "fully visible".
Fully Visible Belief Network (FVBN)
Use Chain rule to decompose likelihood of an image x into product of 1-d distributions.
Pixel RNN vs Pixel CNN
2 different approaches to generative modeling of images at the pixel level. Both models aim to generate realistic images pixel by pixel.
Pixel RNN uses recurrent connections and generates pixels sequentially.
Pixel CNN uses convolutional layers to model dependencies and can generate pixels in parallel.
Pros:
Can explicitly compute likelihood p(x).
Easy to optimize.
Good samples.
Con:
Sequential generation => slow.
Improving PixelCNN performance (Salimans et al. 2017 - PixelCNN++)
Gated convolutional layers
Short-cut connections
Discretized logistic loss
Multi-scale
Training trick.
Improving PixelCNN performance (Salimans et al. 2017 - PixelCNN++)
Gated convolutional layers
Short-cut connections
Discretized logistic loss
Multi-scale
Training trick.
Variational Autoencoders (VAE)
Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data.
z usually smaller than x (dimensionality reduction). (Want features to capture meaningful factors of variation in data)
Learning z? (Train such that features can be used to reconstruct original data “Autoencoding” - encoding input itself)
After training, throw away decoder.
Transfer from large, unlabeled dataset to small, labeled dataset.
How do we make autoencoder a generative model?
Probabilistic spin on autoencoders - will let us sample from the model to generate data!
z is latent factors used to generate x: attributes, orientation, etc.
We want to estimate the true parameters θ* of this generative model given training data x.
How should we represent the model?
Choose prior p(z) to be simple, e.g. Gaussian. Reasonable for latent attributes, e.g. pose, how much smile.
Conditional p(x|z) is complex (generates image) => represent with neural network.
How to train the model
Variational Autoencoders (VAEs) define intractable density function with latent z:
Intractability
References:
Van der Oord et al. NIPS 2016.
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR 2014.