We already discuss how online learning works here using Conjugate distributions with Binomial distribution as the likelihood and Beta distribution as conjugate prior distribution. During this post, we will try to use Gaussian distribution for online learning in Bayesian inference. Conjugate prior of Gaussian distribution is Gaussian itself. That’s why we call Gaussian distribution self-conjugate. Let’s try to derive it.
Given trial result , from Bayesian formula we get:
We will try to derive the posterior , given likelihood and prior distribution . Parameter of Gaussian in this case are and . In this post, we will demonstrate how to calculate posterior under the assumption that are know. Thus, we will only learn parameter . We will ignore marginal probability first, since it is only constant value for normalization. Proceeding our formula above, we can do as follows.
For prior , we can think that it is a distribution function of current given Gaussian parameter which are manually specified/initialized by us. As for term in likelihood and term in prior, they are just constant values, we can ignore it, since we are discussing about distribution of values, not magnitude of values. Then, we can proceed as follows.
We know that product of two Gaussian is another Gaussian. Thus, we will modify equation above by gathering the term regarding the . We will gather all the term regarding . Here we go.
Then, we will compare with new Gaussian distribution of with new parameter . By doing this, we can derive . This process/method is called “completing the square”. Let’s write this new Gaussian distribution first.
To calculate , we can compare our two Gaussian equation and match those term by coefficient . Here we go.
Matching we will get .
Then, matching , we get:
By combining what we get in matching , we will be able to get . Here we go.
Hooray! We just successfully derived posterior probability which is a new Gaussian distribution function with and . When a new trial data is coming, we can just update the parameter iteratively.
For other conditions, such as the known parameter is instead of like what we already derived, or even there is no known parameter, we can derive with similar process. This wikipedia page has nice summary table about Conjugate distributions (pair of likelihood – prior/posterior for online learning). Try to look it up.