# Understanding Online/Sequential Learning in Bayesian Inference

After we understand the concept of Bernoulli, Binomial and Beta distribution we discuss here, we are ready to understand online learning used in Bayesian inference now. In Bayesian theorem we discuss here, we have equation below.

$P(B|A)=\frac{P(A|B)P(B)}{P(A)}$

And for multi classes $c_1, \,c_2, ..., \,c_n$ with multi attributes $\theta_1, \,\theta_2,...,\,\theta_n$, we can write as follow.

$P(c_i|\theta_1,\,\theta_2,\,...,\,\theta_n)=\frac{P(\theta_1,\,\theta_2,\,...,\,\theta_n|c_i)P(c_i)}{P(\theta_1,\,\theta_2,\,...,\,\theta_n)}$

Using rule of sum with $m$ is the number of classes, we can change the denominator becomes:

$P(c_i|\theta_1,\,\theta_2,\,...,\,\theta_n)=\frac{P(\theta_1,\,\theta_2,\,...,\,\theta_n|c_i)P(c_i)}{\sum_{i=1}^{m}P(\theta_1,\,\theta_2,\,...,\,\theta_n)P(c_i)dc_i}$,  for discrete system

$P(c_i|\theta_1,\,\theta_2,\,...,\,\theta_n)=\frac{P(\theta_1,\,\theta_2,\,...,\,\theta_n|c_i)P(c_i)}{\int_{}^{}P(\theta_1,\,\theta_2,\,...,\,\theta_n)P(c_i)dc_i}$,  for continuous system

Here, we can say $P(c_i|\theta_1,\,\theta_2,\,...,\,\theta_n)$ is posterior probability, $P(\theta_1,\,\theta_2,\,...,\,\theta_n|c_i)$ is likelihood,  $P(c_i)$ is posterior probability, and $\sum_{i=1}^{m}P(\theta_1,\,\theta_2,\,...,\,\theta_n)P(c_i)dc_i$ is evidence or marginal probability.

In online learning, we will update our prior probability when we do some new trials. For example, in the first stage, we do some tossing coin, and we model the prior probability with $P_1(X)$. And at this point, we will use prior probability $P_1(X)$ to estimate our posterior probability. Let it (posterior probability) be $P_2(X)$. In the next stage trials, we will use $P_2(X)$ as our prior probability to estimate our next posterior probability. And we will continue to do this when we do some trials again. That’s why we call this online learning. Some references also call it sequential learning.

To do scenario above, we have to choose “suitable” probability model for our prior probability. Why?  Because if we choose “wrong” probability model, we will find difficulty to calculate  $P(c_i|\theta_1,\,\theta_2,\,...,\,\theta_n)=\frac{P(\theta_1,\,\theta_2,\,...,\,\theta_n|c_i)P(c_i)}{P(\theta_1,\,\theta_2,\,...,\,\theta_n)}$. Otherwise, we cannot get analytic function a.k.a closed-form expression to solve that equation, and we might solve it using numerical method. This “suitable” prior probability model is then called conjugate prior.

In Bayesian inference, if the posterior distributions $P(c_i|\theta)$ are in the same family as the prior probability distribution $P(\theta)$, the prior and posterior are called conjugate distributions. Binomial and Beta distribution that we already discuss here are conjugate distributions, which Beta distribution is the conjugate prior. Furthermore, conjugate prior of Gaussian distribution is Gaussian itself, thus we call Gaussian distribution self-conjugate. Now, let’s try to proof that conjugate prior of Binomial distribution is Beta distribution. And we will still use our tossing coin case resulting H (head) and T (tail) as our example to make it easier to understand.

We have Binomial distribution written below.

$P(k; n,p) = \begin{bmatrix} n\\k\end{bmatrix}p^k(1-p)^{1-k}$,

with $k$ number resulting H in $n$ trials, where $P(H)=p$.

and Beta distribution is written below.

$P(p; \alpha, \beta)=\frac{1}{B(\alpha, \beta)}p^{\alpha-1}(1-p)^{\beta-1}$, which describes the distribution of probability $p$

Next, we will try to solve equation $P(c_i|\theta_1,\,\theta_2,\,...,\,\theta_n)=\frac{P(\theta_1,\,\theta_2,\,...,\,\theta_n|c_i)P(c_i)}{\int_{i=1}^{m}P(\theta_1,\,\theta_2,\,...,\,\theta_n)P(c_i)dc_i}$, where our prior probability is $P(p; \alpha, \beta)$ and our likelihood is $P(k; n,p)$. Let’s say we do some new trials resulting $h$ number of heads and $t$ number of tails. Then, our likelihood and prior probability become:

Likelihood $\rightarrow \,P(h,t|p=x) = \begin{bmatrix} h+t\\h\end{bmatrix}x^h(1-x)^{1-h}=\begin{bmatrix} h+t\\h\end{bmatrix}x^h(1-x)^{t}$

Prior $\rightarrow \,P(x)=\frac{1}{B(\alpha, \beta)}x^{\alpha-1}(1-x)^{\beta-1}$

By plugging those two equations into Bayesian formula, we can proof that Beta distribution is conjugate prior of Binomial distribution. Here we go.

$P(p=x|h,t)=\frac{P(h,t|x)P(x)}{\int_{x=0}^{1}P(h,t|x)P(x)dx}\\\\\\ P(p=x|h,t)=\frac{( \begin{bmatrix} h+t\\h\end{bmatrix}x^h(1-x)^{t})(x^{\alpha-1}(1-x)^{\beta-1}/B(\alpha, \beta))}{\int_{x=0}^{1}( \begin{bmatrix} h+t\\h\end{bmatrix}x^h(1-x)^{t})(x^{\alpha-1}(1-x)^{\beta-1}/B(\alpha, \beta))dx}\\\\\\ P(p=x|h,t)=\frac{{\begin{bmatrix} h+t\\h\end{bmatrix}}x^{\alpha+h-1}(1-x)^{\beta+t-1}/{B(\alpha, \beta)}}{\int_{x=0}^{1}{\begin{bmatrix} h+t\\h\end{bmatrix}}x^{\alpha+h-1}(1-x)^{\beta+t-1}/{B(\alpha, \beta)}dx}\\\\\\ P(p=x|h,t)=\frac{x^{\alpha+h-1}(1-x)^{\beta+t-1}}{\int_{x=0}^{1}x^{\alpha+h-1}(1-x)^{\beta+t-1}dx}\\\\\\ \boxed{P(p=x|h,t)=\frac{x^{\alpha+h-1}(1-x)^{\beta+t-1}}{B(\alpha+h,\,\beta+t)}}\, ,$

which is another form of Beta distribution with parameters $(\alpha+h, \,\beta+t)$. This posterior distribution could then be used as the prior for more samples, which is only simply adding how many numbers resulting $head (h)$  and $tail (t)$. And if we do this again and again, we will get our prediction more accurate. Awesome, right? 😀 And since we can solve our Bayesian formula using our Binomial likelihood and Beta prior distribution resulting analytic function, we can say that Binomial and Beta distributions are conjugate distributions.

## (#) Multinomial distribution

We already know how online learning works in binomial distribution. Let’s expand to multinomial distribution. Multinomial distribution describe the probability distribution of a trail having more than two outcomes, such as rolling a dice which gives us six outcomes results $\begin{Bmatrix}1,2,3,4,5,6\end{Bmatrix}$. Mutltinomial distribution is defines as follows.

$\boxed{{Mult(m_1, m_2,....,m_k|p,n)=\begin{bmatrix}n \\ m_1,m_2,...,m_k\end{bmatrix}\prod_{k=1}^{K}p_k^{m_k}}}$,

where $\begin{bmatrix}n \\ m_1,m_2,...,m_k\end{bmatrix}=\frac{n!}{m_1!m_2!...m_k!}$.

The conjugate prior of multinomial distribution is Dirichlet distribution, which is defined as follows.

$Dir(p|\alpha)=\frac{\Gamma (\alpha_0)}{\Gamma (\alpha_1)....\Gamma (\alpha_k)}\prod_{k=1}^{n}p_k^{\alpha_k-1}$

We can calculate the posterior probability with the similar procedure we already did before in beta posterior calculation. Thus, we will skip the detail, and let’s just put the result.

$P(p|\alpha)=\frac{P(alpha|p)P(p)}{P(\alpha)\\\\P(p|\alpha)}=\frac{\Gamma (\alpha_0+n)}{\Gamma (\alpha_1+m_1)....\Gamma (\alpha_k+m_k)}\prod_{k=1}^{n}p_k^{\alpha_k+m_k-1}\\\\\boxed{P(p|\alpha)=Dir(p|\alpha+m)}$

We get the posterior probability which is another Dirichlet distribution. We can iterate to update the parameter once we have new experiment result to do online learning.