# Using Gaussian Distribution for Online Learning/Sequential Learning in Bayesian Inference

We already discuss how online learning works here using Conjugate distributions with Binomial distribution as the likelihood and Beta distribution as conjugate prior distribution. During this post, we will try to use Gaussian distribution for online learning in Bayesian inference. Conjugate prior of Gaussian distribution is Gaussian itself. That’s why we call Gaussian distribution self-conjugate. Let’s try to derive it.

Given trial result $D=\begin{Bmatrix} x_1,x_2,x_3,...,x_n\end{Bmatrix}$, from Bayesian formula we get: $P(\theta_{new}|D)=\frac{P(D|\theta_{old})P(\theta_{old})}{P(D)}$

We will try to derive the posterior $P(\theta_{new}|D)$, given likelihood $P(D|\theta_{old})$ and prior distribution $P(\theta_{old})$. Parameter $\theta$ of Gaussian in this case are $\mu$ and $\sigma^2$. In this post, we will demonstrate how to calculate posterior $P(\theta_{new}|D)$ under the assumption that $\sigma_{new}^2, \mu_{old}, \sigma_{old}^2$ are know. Thus, we will only learn parameter $\mu$. We will ignore marginal probability $P(D)$ first, since it is only constant value for normalization. Proceeding our formula above, we can do as follows. $P(\theta_{new}|D)=\frac{P(D|\theta_{old})P(\theta_{old})}{P(D)}\\\\ P(\mu_{new}|D)=\frac{P(D|\mu_{old})P(\mu_{old})}{P(D)}\\\\ P(\mu_{new}|D) \propto P(D|\mu_{old})P(\mu_{old})\\\\ P(\mu_{new}|D) \propto \prod_{i=1}^{n} [P(D|\mu_{old})]P(\mu_{old})\\\\ P(\mu_{new}|D) \propto [\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x_1-\mu)^2}{2\sigma^2}}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x_2-\mu)^2}{2\sigma^2}}...\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x_n-\mu)^2}{2\sigma^2}}]\frac{1}{\sqrt{2\pi}\sigma_0}e^{-\frac{(\mu-\mu_0)^2}{2\sigma_0^2}}\\\\ P(\mu_{new}|D) \propto (\frac{1}{\sqrt{2\pi}\sigma})^m[e^{-\sum_{i=1}^{n}\frac{(x_i-\mu)^2}{2\sigma^2}}]\frac{1}{\sqrt{2\pi}\sigma_0}e^{-\frac{(\mu-\mu_0)^2}{2\sigma_0^2}}$

For prior $\frac{1}{\sqrt{2\pi}\sigma_0}e^{-\frac{(\mu-\mu_0)^2}{2\sigma_0^2}}$, we can think that it is a distribution function of current $\mu$ given Gaussian parameter $\mu_0, \sigma_0^2$ which are manually specified/initialized by us. As for term $(\frac{1}{\sqrt{2\pi}\sigma})^m$ in likelihood and term $\frac{1}{\sqrt{2\pi}\sigma_0}$ in prior, they are just constant values, we can ignore it, since we are discussing about distribution of values, not magnitude of values. Then, we can proceed as follows. $P(\mu_{new}|D) \propto e^{-\sum_{i=1}^{n}\frac{(x_i-\mu)^2}{2\sigma^2}}e^{-\frac{(\mu-\mu_0)^2}{2\sigma_0^2}}\\\\ P(\mu_{new}|D) \propto e^{-\sum_{i=1}^{n}\frac{(x_i-\mu)^2}{2\sigma^2}+\frac{(\mu-\mu_0)^2}{2\sigma_0^2}}\\\\ P(\mu_{new}|D) \propto e^{-\sum_{i=1}^{n}\frac{1}{2\sigma^2}(x_i^2+\mu^2-2x_i\mu)+\frac{1}{2\sigma_0^2}(\mu^2+\mu_0^2-2\mu_0\mu)}$

We know that product of two Gaussian is another Gaussian. Thus, we will modify equation above by gathering the term regarding the $\mu$. We will gather all the term regarding $\mu^2, \mu^1, \mu^0$. Here we go. $P(\mu_{new}|D) \propto e^{[-\frac{\mu^2}{2}(\frac{1}{\sigma_0^2}+\frac{n}{\sigma})+\mu(\frac{\mu_0}{\sigma_0^2}+\frac{\sum_{i=1}^{n}x_i}{\sigma})-(\frac{\mu_0^2}{2\sigma_0^2}+\frac{\sum_{i=1}^{n}x_i^2}{2\sigma^2})]}$

Then, we will compare with new Gaussian distribution of $\mu$ with new parameter $\mu_{new}, \sigma_{new}$. By doing this, we can derive $\mu_{new}$. This process/method is called “completing the square”. Let’s write this new Gaussian distribution first. $P(\mu|\mu_{new}, \sigma_{new}^2) \propto e^{-\frac{(\mu-\mu_{new})^2}{2\sigma_{new}^2}}\\\\ P(\mu|\mu_{new}, \sigma_{new}^2) \propto e^{-\frac{1}{2\sigma_{new}^2}(\mu^2-2\mu\mu_{new}+\mu_{new}^2)}$

To calculate $\mu_{new}$, we can compare our two Gaussian equation and match those term by coefficient $\mu^2,\mu^1,\mu^0$.  Here we go. $e^{[-\frac{\mu^2}{2}(\frac{1}{\sigma_0^2}+\frac{n}{\sigma})+\mu(\frac{\mu_0}{\sigma_0^2}+\frac{\sum_{i=1}^{n}x_i}{\sigma})-(\frac{\mu_0^2}{2\sigma_0^2}+\frac{\sum_{i=1}^{n}x_i^2}{2\sigma^2})]}=e^{-\frac{1}{2\sigma_{new}^2}(\mu^2-2\mu\mu_{new}+\mu_{new}^2)}$

Matching $\mu^2$ we will get $\sigma_n^2$. $\frac{-\mu^2}{2\sigma_{new}^2}=\frac{-\mu^2}{2}(\frac{1}{\sigma_0^2}+\frac{n}{\sigma^2})\\\\ \frac{1}{2\sigma_{new}^2}=(\frac{1}{\sigma_0^2}+\frac{n}{\sigma^2})\\\\ \sigma_{new}^2=\frac{\sigma^2\sigma^2_{_0}}{n\sigma_0^2+\sigma^2}=\frac{1}{\frac{n}{\sigma^2}+\frac{1}{\sigma_0^2}}\\\\ \boxed{ \sigma_{new}^2=(\frac{n}{\sigma^2}+\frac{1}{\sigma_0^2})^{-1}}$

Then, matching $\mu$, we get: $\frac{-2\mu\mu_{new}}{-2\sigma_{new}^2}=\mu(\frac{\sum_{i=1}^{n}x_i}{\sigma^2}+\frac{\mu_0}{\sigma_0^2})\\\\ \frac{\mu_{new}}{\sigma_{new}^2}=\frac{\sum_{i=1}^{n}x_i}{\sigma^2}+\frac{\mu_0}{\sigma_0^2}\\\\ \frac{\mu_{new}}{\sigma_{new}^2}=\frac{n\mu_{_{MLE}}}{\sigma^2}+\frac{\mu_0}{\sigma_0^2}\\\\ \frac{\mu_{new}}{\sigma_{new}^2}=\frac{\sigma_0^2n\mu_{_{MLE}}+\sigma^2\mu_0}{\sigma^2\sigma_0^2}\\\\ \sigma_{new}^2=\frac{\sigma^2\sigma_0^2}{\sigma_{0}^2n\mu_{_{MLE}}+\sigma^2\mu_0}\mu_{new}$

By combining $\sigma_{new}^2$ what we get in matching $\mu^2, \mu$, we will be able to get $\mu_{new}$. Here we go. $\frac{\sigma^2\sigma_0^2}{n\sigma_0^2\sigma^2}=\frac{\sigma^2\sigma_0^2}{\sigma_n^2 n \mu_{_{MLE}}+\sigma^2\mu_0}\mu_{new}\\\\ \mu_{new}=\frac{\sigma_n^2n\mu_{_{MLE}}+\sigma^2\mu_0^2}{n\sigma_0^2+\sigma^2}=\frac{\sigma^2\sigma_0^2}{n\sigma_0^2+\sigma^2}[\frac{\mu_0}{\sigma_0^2}+\frac{n\mu_{_{MLE}}}{\sigma^2}]\\\\ \boxed {\mu_{new}=\sigma_{new}^2[\frac{\mu_0}{\sigma_0^2}+\frac{n\mu_{_{MLE}}}{\sigma^2}]}$

​Hooray! We just successfully derived posterior probability which is a new Gaussian distribution function with $\mu_{new}=\sigma_{new}[\frac{\mu_0}{\sigma_0^2}+\frac{n\mu_{_{MLE}}}{\sigma^2}]$ and $\sigma_{new}^2=\frac{\sigma^2\sigma^2_{_0}}{n\sigma_0^2+\sigma^2}$. When a new trial data is coming, we can just update the parameter iteratively.

For other conditions, such as the known parameter is $\mu$ instead of $\sigma^2$ like what we already derived, or even there is no known parameter, we can derive with similar process. This wikipedia page has nice summary table about Conjugate distributions (pair of likelihood – prior/posterior for online learning). Try to look it up.