Related Reading
- Deeap Learning Book - Chapter 14 - Autoencoders
- Autoencoders
- Tutorial on Variational Autoencoders (local copy)
Some words about latent variable
This section is copied from Tutorial on Variational Autoencoders. It nicely explains the concept of latent variable:
When training a generative model, the more complicated the dependencies between the dimensions, the more difficult the models are to train. Take, for example, the problem of generating images of handwritten characters. Say for simplicity that we only care about modeling the digits 0-9. If the left half of the character contains the left half of a 5, then the right half cannot contain the left half of a 0, or the character will very clearly not look like any real digit. Intuitively, it helps if the model first decides which character to generate before it assigns a value to any specific pixel. This kind of decision is formally called a latent variable. That is, before our model draws anything, it first randomly samples a digit value z from the set [0, ..., 9], and then makes sure all the strokes match that character. z is called ‘latent’ because given just a character produced by the model, we don’t necessarily know which settings of the latent variables generated the character. We would need to infer it using something like computer vision.
Autoencoder
The idea of autoencoder is to use a neural network to learn the input data. To achieve this goal we could make output layer same as the input layer. However, we don't want to make an exact copy of the input data, otherwise we will just get an identity function. What we want is a network structure that can copy useful properties of the input data and has the following additional properties:
- sparsity of the representation
- smallness of the derivatives of the representation
- robustness to noise or to missing inputs
Some traditional usages of autoencoders are
- dimensionality reduction
- feature learning
The tradeoff is between low dimentional representation and small reconstruction error. Regularization techniques are usually needed in an autoencoder network.

Fig: typical autoencode network structure
Sparse Autoencoder
A sparse autoencoder is simply an autoencoder that has a regularization term on the hidden layer. The loss function is
where \(x\) is the input data, \(f(x)\) represents the encoder and \(g(h)\) represents the decoder. \(h\) is the hidden layer.
Denoising Autoencoder
If we add some random noise to the input data before feeding them into the autoencoder, then we get a denoising autoencoder. The idea is to make the network resistant to some perturbations of the input by introducing noise to the training data.

Contractive Autoencoder
Similar to sparse autoencoder, we have regularization term on the hidden layer. More specifically, in a contractive autoencoder we have
Variational Autoencoder
Variational autoencoder is a generative stochastic networks. It leanrs the probability of the training data, which can be in turn used to generate data that is similar to input data.
First, we write the conditional probability of an input data \(X\)
where \(z\) is the latent variable and \(\theta\) is the model parameter. The distribution of latent variables is assumed to be a normal distribution \(\mathcal{N}(0, I)\).
\(P(X|z;\theta)\) represents the modeling part. In a standard setup, this conditional probability is assumed to be a normal distribution as well. More specifically, we assume
In VAE, \(f(z; \theta)\) is represented by a neural network. Now \(P(X|z, \theta) \) is the probability of having the input data \(X\) conditional on the latent variable \(z\). It basically answers the question: given a latent variable \(z\), what are the input data we can get? This task is similar to the decoder part in an autoencoder network. The difference is that here it's a probability distribution instead of a single value.
As usual Eq (1) is used to construct a likelihood function and we want to maximize it. Here comes the core equation in VAE. We can transform Eq (1) to the following equation so that the calculation becomes tractable:
where \(Q\) is an arbitrary probability distribution and \(\mathcal{D}\) is the KL divergence.
The next question is obviously how to choose/construct the distribution \(Q\). It should meet the following conditions:
- It should be easy to calculate and ideally it should have a closed formula.
- It should be easy to calculate \(\mathcal{D}[Q(z)||P(z)]\)
- It should be close enough to the true conditional probability \(P(z|X)\). It's natural to make \(Q(z)\) depend on \(X\).
In VAE, we assume
In VAE \(\mu(X)\) and \(\Sigma(X)\) are implemented as neural networks. Note that \(Q(z|X)\) behaves like an encode because it provides a description of latent variables based on the input data \(X\).
With \(Q\) defined, let's examine the terms in Eq (2).
- \(\mathcal{D}[Q(z)||P(z|X)]\) is not tractable but we know it's a positive number. Therefore, the right side of Eq (2) gives a lower bound of \(log\;P(X)\) and we can ignore the term \(\mathcal{D}[Q(z)||P(z|X)]\).
- \(E_{z\sim{}Q}[logP(X|z)]\) is tricky because it requires sampling of latent variable \(z\). However, during the training we may use stochastic gradient descent method or mini batch gradient descent. For convenience, we could assume \(P(X|z \sim Q; \theta) \approx E_{z\sim{}Q}[logP(X|z; \theta)]\).
- \(\mathcal{D}[Q(z)||P(z)]\) can be calculated using closed formula because the two distribution involved are both normal distribution.
Put everything together, we have the following network structure for training:

If we want to generate a data that is similar to input data, we only need to generate a sample from a standard normal distribution and feed it into the decoder:

Conditional VAE
The math is almost the same. Please find more details in the Tutorial on Variational Autoencoders. Here is the network structure

----- END -----
©2019 - 2023 all rights reserved