\n",
" \n",
"\n",
"\n",
"\n",
"### Derivation of ELBO\n",
"\n",
"Posterior Probability $p(z|x)$, which can be expressed as:\n",
"\n",
"\\begin{eqnarray}\n",
"p(z|x)&=&\\frac{p(x|z)p(z)}{p(x)}\\nonumber\\\\\n",
"&=& \\frac{p(x|z)p(z)}{\\int p(x|z)p(z)}\n",
"\\end{eqnarray}\n",
"\n",
"where $\\int p(x|z)p(z)$, which is the marginal, can be intractable and cannot be computed directly. One way to compute the overall solution $p(z|x)$ is using Monte Carlo methods (such as sampling). The method used in this notebook (and the underlying VAE paper) is ***variational inference***. \n",
"\n",
"The idea is to identify another proxy distribution $q(z|x)$ that reasonably approximates $p(z|x)=p(x|z)p(z)$. i.e. if the KL-divergence between two pdfs, $q(x)$ and $p(z|x)$ is denoted by\n",
"\n",
"$$\\mathrm{KL}(q(x)||p(z|x))$$\n",
"\n",
"\n",
"it can be minimized by selecting an alternative pdf $q(z|x)$, which is a good proxy for $p(z|x)$. But \n",
"\n",
"\\begin{eqnarray}\n",
"\\mathrm{KL}(q(z|x)||p(z|x)) &=& -\\int q(z|x)\\log\\frac{p(z|x)}{q(z|x)} dz\\nonumber\\\\\n",
" &=& -\\int q(z|x)\\log\\frac{p(x|z)p(z)}{p(x)q(z|x)} dz\\nonumber\\\\\n",
" &=& -\\int q(z|x)\\log\\frac{p(x|z)p(z)}{q(z|x)}dz + \\int_{z} q(z|x)\\log p(x)dz \\nonumber\\\\\n",
" &=& -\\int q(z|x)\\log\\frac{p(x|z)p(z)}{q(z|x)} + \\log p(x)\\int_{z} q(z|x)dz\\nonumber\\\\ \n",
" &=& -\\int q(z|x)\\log\\frac{p(x|z)p(z)}{q(z|x)}dz + \\log p(x)\\nonumber\\\\\n",
" &=& -\\int q(z|x)\\log\\frac{p(z)}{q(z|x)}dz -\\int q(z|x)\\log{p(x|z)}dz + \\log p(x) \n",
"\\end{eqnarray}\n",
"\n",
"\n",
"Given that $\\mathrm{KL}\\left(q(z|x)||p(z|x)\\right)\\geq 0$, \n",
"\n",
"\n",
"\n",
"\n",
"\\begin{eqnarray}\n",
"-\\int q(z|x)\\log\\frac{p(z)}{q(z|x)}dz -\\int q(z|x)\\log{p(x|z)}dz + \\log p(x) &\\geq& 0 \\\\\n",
"\\log p(x) &\\geq& \\int q(z|x)\\log\\frac{p(z)}{q(z|x)}dz + \\int q(z|x)\\log{p(x|z)}dz\\\\\n",
"\\log p(x) &\\geq& - \\mathrm{KL}(q(z|x)||p(z)) + \\int q(z|x)\\log p(x|z)dz \\nonumber\\\\\n",
"\\log p(x) &\\geq& - \\mathrm{KL}(q(z|x)||p(z)) + \\mathbb{E}_{q(z|x)}\\left[\\log p(x|z)\\right] \\nonumber\\\\\n",
"\\end{eqnarray}\n",
"\n",
"This is the *variational lower-bound*, or the evidence of lower bound (ELBO). This remains as the objective function for the VAE. However, frameworks like TensorFlow or PyTorch need a loss function to be minimized. Maximising the log likelihood of the model evidence $p(x)$ is same as minimizing the $-\\log p(x)$. The first term of the ELBO, namely, $\\mathrm{KL}(q(z|x)||p(z))$ is the *regularising* term and constrains the posterior distribution. The second term of the ELBO models the reconstruction loss. \n",
"\n",
"Now, this leaves fair bit of freedom on the choice of the prior $p(z)$. Let's assume:\n",
"\n",
"\n",
"$$\n",
"p(z)={\\cal N}(\\mu_p, \\sigma_p^2)\n",
"$$\n",
"\n",
"and \n",
"\n",
"$$\n",
"q(z|x)={\\cal N}(\\mu_q, \\sigma_q^2)\n",
"$$\n",
"\n",
"Thus, \n",
"\n",
"$$\n",
"p(z)=\\frac{1}{\\sqrt{2\\pi\\sigma_p^2}}\\exp\\left(\\frac{(x-\\mu_p)^2}{2\\sigma_p^2}\\right)\n",
"$$\n",
"\n",
"and \n",
"\n",
"$$\n",
"q(z|x)=\\frac{1}{\\sqrt{2\\pi\\sigma_q^2}}\\exp\\left(\\frac{(x-\\mu_q)^2}{2\\sigma_q^2}\\right)\n",
"$$\n",
"\n",
"The direct derivation of $\\mathrm{KL}(q(z|x)||p(z))$ will give (with some simplifications)\n",
"\n",
"\n",
"$$\n",
"-\\mathrm{KL}(q(z|x)||p(z)) = \\log\\frac{\\sigma_q}{\\sigma_p} - \\frac{\\left(\\log\\sigma_q^2-(\\mu_p-\\mu_q)^2\\right)}{2\\sigma_p^2} +\\frac{1}{2} \n",
"$$\n",
"\n",
"By fixing the prior distribution $p(z)={\\cal N}(0,1^2)$, \n",
"\n",
"$$\n",
"-\\mathrm{KL}(q(z|x)||p(z)) = \\frac{1}{2}\\left[ 1 + \\log\\sigma_q^2 - \\sigma_q^2 -\\mu_q^2\\right]\n",
"$$\n",
"\n",
"Hence, the new ELBO is\n",
"\n",
"\n",
"$$\n",
"\\frac{1}{2}\\left[ 1 + \\log\\sigma_q^2 - \\sigma_q^2 -\\mu_q^2\\right] + \\mathbb{E}_{q(z|x)}\\left[\\log p(x|z)\\right] \n",
"$$\n",
"\n",
"\n",
"Let $J, B$ and $\\cal{L}$ be the dimension of the latent space, and the batch size over which the sampling is done. The loss function we need to minimise (from the point of implementation) is\n",
"\n",
"$$\n",
"{\\cal L} = - \\sum_{j=1}^J \\frac{1}{2}\\Bigl[ 1 + \\log\\sigma_j^2 - \\sigma_j^2 -\\mu_j^2\\Bigr] - \\frac{1}{B}\\sum_{l}\\mathbb{E}_{q(z|x_i)}\\left[\\log p(x_i|z^{(i,l)})\\right] \n",
"$$\n",
"\n",
"\n",
"\n",
"This can be observed in the code implementation below (see function implementation ``loss_function`` below)\n",
"\n",
"### Reparameterisation\n",
"\n",
"A valid reparameterization would be \n",
"\n",
"$$\n",
"z = \\mu+\\sigma\\epsilon\n",
"$$\n",
"\n",
"\n",
"where $\\epsilon$ is an auxiliary noise variable $\\epsilon\\sim{\\cal{N}}(0, 1)$, which actually enables the reparameterization technique. Although it is possible to use $\\sigma$ or more specifically $\\sigma^2$, working on log scales improves the stability. i.e. \n",
"\n",
"\\begin{eqnarray}\n",
"p &=& \\log(\\sigma^2)\\\\\n",
"&=& 2 \\log(\\sigma)\n",
"\\end{eqnarray}\n",
"\n",
"To get the log standard deviation, $\\log(\\sigma)$, \n",
"\\begin{eqnarray}\n",
"\\log(\\sigma) &=& p/2 \\\\\n",
"\\label{eqn:log_sigma}\n",
"\\end{eqnarray} \n",
"\n",
"and hence\n",
"\n",
"$$\n",
"\\sigma = \\exp^{p/2}\n",
"$$\n",
"\n",
"The resulting estimator (or the loss function) becomes (see Page 5 of [Auto-Encoding Variational Bayes Paper](https://arxiv.org/abs/1312.6114)),\n",
"\n",
"$$\n",
"-\\text{KLD} = \\frac{1}{2}\\sum_{j=1}^{J}(1+\\log(\\sigma_j^i)^2 - (\\mu_j^i)^2 -(\\sigma_j^i)^2)\n",
"$$\n",
"\n",
"\n",
"It is important to see that the KL divergence can be computed and differentiated without estimation. This is a very remarkable thing (no esimtation!).\n",
"\n",
"The $\\boldsymbol{\\epsilon}$ must be sampled from a zero-mean, unit-variance Gaussian distribution, and should be of the same size as $\\boldsymbol{\\sigma}$. \n",
"\n",
"\n",
"
\n",
"\n",
"\n",
" \n",
"