{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Stochastic Variational Bayes\n",
"=======================\n",
"\n",
"This notebook implements Example 1 from the FMRIB tutorial on Variational Bayes II: Stochastic Variational Bayes (\"fitting a Gaussian distribution).\n",
"\n",
"We assume we have data drawn from a Gaussian distribution with true mean $\\mu$ and true precision $\\beta$:\n",
"\n",
"$$\n",
"P(y_n | \\mu, \\beta) = \\frac{\\sqrt{\\beta}}{\\sqrt{2\\pi}} \\exp{-\\frac{\\beta}{2} (y_n - \\mu)^2}\n",
"$$\n",
"\n",
"One interpretation of this is that our data consists of repeated measurements of a fixed value ($\\mu$) combined with Gaussian noise with standard deviation $\\frac{1}{\\sqrt{\\beta}}$.\n",
"\n",
"Here's how we can generate some sample data from this model in Python:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"# Ground truth parameters\n",
"# We infer the precision, BETA, but it is useful to\n",
"# derive the variance and standard deviation from it\n",
"MU_TRUTH = 42\n",
"BETA_TRUTH = 1.0\n",
"VAR_TRUTH = 1/BETA_TRUTH\n",
"STD_TRUTH = np.sqrt(VAR_TRUTH)\n",
"\n",
"# Observed data samples are generated by Numpy from the ground truth\n",
"# Gaussian distribution. Reducing the number of samples should make\n",
"# the inference less 'confident' - i.e. the output variances for\n",
"# MU and BETA will increase\n",
"N = 100\n",
"DATA = np.random.normal(MU_TRUTH, STD_TRUTH, [N])\n",
"print(\"Data samples are:\")\n",
"print(DATA)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the 'signal + noise' interpretation we can view this as noisy measurements (red crosses) of a constant signal (green line):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from matplotlib import pyplot as plt\n",
"plt.figure()\n",
"plt.plot(DATA, \"rx\")\n",
"plt.plot([MU_TRUTH] * N, \"g\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As with analytic Variational Bayes, we need to choose an approximate form for our priors and posteriors. However we have more freedom in the stochastic method since we are not limited by the requirement that the distributions be conjugate with respect to the likelihood. \n",
"\n",
"We will choose a multivariate normal distribution (MVN) over the two parameters $\\mu$ and $log(\\frac{1}{\\beta})$ for prior and posterior. Inferring the log of the noise variance is useful as it avoid the possibility of negative values during the optimization which would make the likelihood ill-defined.\n",
"\n",
"The choice of an MVN means that we can allow for covariance (correlation) between the noise and signal parameters. This is unlike the analytic case where the posterior had to be factorised over these two parameters.\n",
"\n",
"An MVN distribution for $N$ parameters is defined by a vector of $N$ mean values and an $N \\times N$ covariance matrix. For the prior we will use the following values:\n",
"\n",
"$$\\textbf{m}_0 = \\begin{bmatrix} \\mu_0 \\\\ b_0 \\end{bmatrix}$$\n",
"\n",
"$$\\textbf{C}_0 = \\begin{bmatrix} v_0 & 0 \\\\ 0 & w_0 \\end{bmatrix}$$\n",
"\n",
"Note that we are not assuming any prior covariance. \n",
"\n",
"We define some suitable relatively uninformative prior values here:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"m0 = 0.0\n",
"v0 = 100000.0\n",
"b0 = 0.0\n",
"w0 = 100000.0\n",
"print(\"Priors: P(mu, log(1/beta)) = MVN([%f, %f], [[%f, 0], [0, %f]])\" % (m0, v0, b0, w0))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Stochastic VB is based around minimising the free energy so we will need to implement the calculation of the free energy. We will be using the TensorFlow framework to perform the minimisation so the calculation must be in terms of tensors (multidimensional arrays). In our case the following constant tensors must be defined (where $N$ is the number of data values we have:\n",
"\n",
" - Data samples: $[N]$\n",
" - Prior mean: $[2]$\n",
" - Prior covariance: $[2 \\times 2]$\n",
"\n",
"We must also define *variable* tensors for the posterior - TensorFlow will allow these to change during the optimization in order to minimise the cost (free energy):\n",
"\n",
" - Posterior mean: $[2]$\n",
" - Posterior covariance: $[2 \\times 2]$\n",
"\n",
"The posterior covariance must be a positive-definite matrix - since the optimizer does not know this, it is possible that invalid values may arise and the optimization will fail. To get around this restriction we build the covariance matrix from its Chlolesky decomposition.\n",
"\n",
"$$\\textbf{C} = (\\textbf{C}_{chol})^T\\textbf{C}_{chol}$$\n",
"\n",
"$\\textbf{C}_{chol}$ must have positive diagonal elements, so we define the underlying variables for these elements in terms of the log and then form the full $\\textbf{C}_{chol}$ matrix as a sum of the exponentials of the log-diagonal elements, and independent variables for the off-diagonal components.\n",
"\n",
"The code for this is below. Note that we need to define initial values for the posterior variables. It turns out that in the stochastic method it is important that the initial poster variance is not too large so although the prior is not informative, the initial posterior is. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" import tensorflow.compat.v1 as tf\n",
"except ImportError:\n",
" import tensorflow as tf\n",
"import numpy as np\n",
"data = tf.constant(DATA, dtype=tf.float32)\n",
"prior_means = tf.constant([m0, v0], dtype=tf.float32)\n",
"prior_covariance = tf.constant([[v0, 0.0], [0.0, w0]], dtype=tf.float32)\n",
"\n",
"post_means_init = [0.0, 0.0]\n",
"post_covariance_init = [[1.0, 0.0], [0.0, 1.0]]\n",
"\n",
"chol_off_diag = tf.Variable([[0, 0], [0, 0]], dtype=tf.float32)\n",
"# Comment in this line if you do NOT want to infer parameter covariances\n",
"#chol_off_diag = tf.constant([[0, 0], [0, 0]], dtype=tf.float32)\n",
"\n",
"chol_log_diag = tf.Variable(tf.log(tf.diag_part(post_covariance_init)), dtype=tf.float32)\n",
"chol_diag = tf.diag(tf.sqrt(tf.exp(chol_log_diag)))\n",
"post_covariance_chol = tf.add(chol_diag, tf.matrix_band_part(chol_off_diag, -1, 0))\n",
"\n",
"post_covariance = tf.matmul(tf.transpose(post_covariance_chol), post_covariance_chol)\n",
"post_means = tf.Variable(post_means_init, dtype=tf.float32)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At this point we will show how we actually evaluate calculations in TensorFlow. TensorFlow works by defining a graph of operations, but no calculation actually takes place until you ask it to evaluate a tensor (which might be defined in terms of other tensors or Variables. For example we can evaluate the inital values of our posterior as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sess = tf.Session()\n",
"sess.run(tf.initialize_all_variables())\n",
"print(\"Initial posterior mean: %s\" % sess.run(post_means))\n",
"print(\"Initial posterior covariance:\\n%s\" % sess.run(post_covariance))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The next requirement is to be able to obtain a sample of *predicted* data values from the posterior. In stochastic VB this is used to approximate the integrals in the calculation of the free energy. It is *not* related to the number of data samples we have. Smaller values give quicker calculation, but may result in a noisy, non-convergent optimization. We'll start off with a sample size of 5, but you can change this later if you want.\n",
"\n",
"Note the use of the 'reparameterization trick' to express the samples as the scaling of a fixed MVN distribution - this improves the ability of the optimizer to choose better values for the next iteration."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Number of samples from the posterior\n",
"S=5\n",
"\n",
"# eps is a sample from a Gaussian with mean 0 and variance 1\n",
"eps = tf.random_normal((2, S), 0, 1, dtype=tf.float32)\n",
"\n",
"# Start off each sample with the current posterior mean\n",
"# post_samples is now a tensor of shape [2, n_samples]\n",
"samples = tf.tile(tf.reshape(post_means, [2, 1]), [1, S])\n",
"\n",
"# Now add the random sample scaled by the covariance\n",
"post_samples = tf.add(samples, tf.matmul(post_covariance_chol, eps))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see how this works by evaluating a sample on our initial posterior"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(sess.run(post_samples))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that we have 5 samples for each of the two parameters - the top row are samples values for $\\mu$ and the bottom row are samples of the noise log variance $-\\log{\\beta}$.\n",
"\n",
"Next we need to implement the free energy calculation. This is the sum of the reconstruction loss (the extent to which the posterior matches the data) and the latent loss (the extent to which the posterior matches the prior). Let's do the reconstruction loss first which is the expected value of the log-likelihood across the posterior distribution. Remember that the expectation integral is being approximated using the samples we have from the posterior. So we evaluate the log-likelihood for *each* set of samples and then take the mean across all the samples."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# These are out sample of values for the mu parameter\n",
"mu_samples = post_samples[0]\n",
"\n",
"# Get the current estimate of the noise variance remembering that\n",
"# we are inferring the log of the noise precision, beta\n",
"log_noise_var = -post_samples[1]\n",
"noise_var = tf.exp(log_noise_var)\n",
"\n",
"# Each sample value predicts the full set of values in the data sample.\n",
"# For our constant-signal model, the prediction is simply a set of \n",
"# constant values. The prediction tensor will have shape [S, N]\n",
"# where S is the sample size and N is the number of data values\n",
"prediction = tf.tile(tf.reshape(mu_samples, [S, 1]), [1, N])\n",
"\n",
"# To calculate the likelihood we need the sum of the squared difference between the data \n",
"# and the prediction. This gives a value for each posterior sample so has shape [S]\n",
"sum_square_diff = tf.reduce_sum(tf.square(data - prediction), axis=-1)\n",
"\n",
"# Now we calculate the likelihood for each posterior sample (shape [S])\n",
"# Note that we are ignoring constant factors such as 2*PI here as they \n",
"# are just an fixed offset and do not affect the optimization \n",
"log_likelihood = 0.5 * (-log_noise_var * tf.to_float(N) - sum_square_diff / noise_var)\n",
"\n",
"# Finally to evaluate the expectation value we take the mean across all the posterior\n",
"# samples\n",
"reconstr_loss = -tf.reduce_mean(log_likelihood)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Reconstruction loss is: %f\" % sess.run(reconstr_loss))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On to the latent loss, this is the log-KL divergence between the posterior and prior. Since\n",
"both our prior and posterior are MVN distributions, we can use a known analytic result as\n",
"given in the tutorial:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"C = post_covariance\n",
"C0 = prior_covariance\n",
"C0_inv = tf.matrix_inverse(C0)\n",
"\n",
"# m - m0 as row and column vectors\n",
"m_minus_m0 = tf.reshape(tf.subtract(post_means, prior_means), [-1, 1])\n",
"m_minus_m0_T = tf.reshape(tf.subtract(post_means, prior_means), [1, -1])\n",
"\n",
"term1 = tf.trace(tf.matmul(C0_inv, C))\n",
"term2 = -tf.log(tf.matrix_determinant(C) / tf.matrix_determinant(C0))\n",
"\n",
"# Size of the MVN distribution\n",
"term3 = -2\n",
"term4 = tf.matmul(tf.matmul(m_minus_m0_T, C0_inv), m_minus_m0)\n",
" \n",
"latent_loss = 0.5 * (term1 + term2 + term3 + term4)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Latent loss is: %f\" % sess.run(latent_loss))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that if you re-evaluate the reconstruction loss you get a different answer each time - because it depends on the sample from the posterior. But the latent loss is being calculated analytically and is independent of the sample, so we get the same answer each time.\n",
"\n",
"(If you increase the posterior sample size you will get a more consistent value of the reconstruction loss - try it!)\n",
"\n",
"Now we define the total cost (free energy) as the sum of the latent and reconstruction costs:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cost = reconstr_loss + latent_loss\n",
"print(\"Total cost is: %f\" % sess.run(cost))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To optimize the posterior we need to iteratively minimise the cost using TensorFlow's built in gradient optimizer. We use the Adam optimizer for this, which is a refinement of the standard Gradient Descent optimizer. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"optimizer = tf.train.AdamOptimizer(learning_rate=0.4)\n",
"minimizer = optimizer.minimize(cost)\n",
"sess.run(tf.global_variables_initializer())\n",
"\n",
"cost_history = []\n",
"for epoch in range(5000):\n",
" sess.run(minimizer)\n",
" cost_history.append(float(sess.run(cost)))\n",
" print(\"Epoch %i: cost=%f, posterior means=%s\" % (epoch+1, sess.run(cost), sess.run(post_means)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We should find our estimates for mu and beta are as expected, remembering that we chose\n",
"to estimate $-\\log{\\beta}$:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"final_means = sess.run(post_means)\n",
"final_covariance = sess.run(post_covariance)\n",
"print(\"Estimate for mu: %f (variance: %f)\" % (final_means[0], final_covariance[0, 0]))\n",
"print(\"Estimate for beta: %f\" % np.exp(final_means[1]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The convergence of the minimisation in this case is rather slow - considering the simplicity of the problem. This is partly due to the large difference between the initial posterior and the true value for $\\mu$. However, frameworks like TensorFlow are not really optimized for inferring such a small number of parameters and the method. If we plot the cost history we can see that it gets stuck in a local minimum for some time:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from matplotlib import pyplot as plt\n",
"plt.plot(cost_history)\n",
"plt.ylim(50000, 51000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also plot our inferred value of $\\mu$ with error bars $\\pm 2\\sigma$ derived from the inferred variance on this parameter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure()\n",
"plt.plot(DATA, \"rx\")\n",
"plt.plot([MU_TRUTH] * N, \"g\")\n",
"plt.plot([final_means[0]] * N, \"b\")\n",
"plt.plot([final_means[0]+2*np.sqrt(final_covariance[0, 0])] * N, \"b--\")\n",
"plt.plot([final_means[0]-2*np.sqrt(final_covariance[0, 0])] * N, \"b--\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The main advantage of the SVB approach is the flexibility - as well as losing the restriction to conjugate priors, it is also much easier to implement more advanced types of parameters and priors, for example global parameters or spatial regularization priors. While these can (and have been) incorporated into the analytic VB framework, they require update equations to be re-derived whereas the SVB method simply needs an expression for the cost which is generally more straightforward.\n",
"\n",
"Some things you might like to try with this example:\n",
"\n",
" - Do not infer the covariance between the parameters (see commented out code in the definition of the posterior). This generally makes the convergence less noisy.\n",
" - Modify the number of samples and the learning rate and see how they affect the convergence\n",
" - Try implementing the stochastic form of the latent loss rather than the analytic result for an MVN that we have used here (see Box 1 in the tutorial). This is necessary in cases where we do not assume an MVN structure for the prior and posterior. Essentially you need to calculate the log PDF for the prior and posterior for each sample and take the mean over all the samples.\n",
" \n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}