# Some Important Probability Distributions to Understand Online Learning in Bayesian Inference

Probability is important tool in machine learning. During this article, we will discuss some important probability distributions when we want to learn so-called online learning/sequential learning which is a powerful learning method in machine learning based on Bayesian inference. We will start with a very simple distribution, which is Bernoulli Distribution.

## (1) Bernoulli distribution

This is a very simple distribution that describes one trial experiment with binary outcomes, for example tossing coin with head and tail result. We can write the probability as follows. $P(X=1)=p\\\\ P(X=0)=(1-p)=q$

where $X$ is a random variable mapping $head=0$ and $tail=1$

Two equations above, we can write in one equation $P(X=X_n)=p^{X_n} (1-p)^{1-X_n}$, where $X_N=\begin{Bmatrix}0, & 1 \end{Bmatrix}$.

An example of Bernoulli distribution is shown below. source: http://mathworld.wolfram.com

Distribution above means chance of tossing coin resulting head ( $X=0)$ is $0.4$ and chance of resulting tail  ( $X=1)$ is $0.6$. Very simple, right?

Let’s us continue to derive the expected value (mean) and variance. $mean=E[X]=\sum_{i=1}^{2}X_iP(X_i)\\\\ E[X]=1\times P+0\times (1-q)=p\\\\\boxed{E[X]=p}\\\\ variance=Var[X]=E[X^2]-E[X]^2=\sum_{i=1}^{n}(X_i^2P(X_i))-p^2\\\\ Var[X]=1^2\times p+0^2 \times (1-q)-p^2=p-p^2=p(1-p)=pq\\\\\boxed{Var[X]=pq}$

Now, let’s derive the MLE (Maximum Likelihood Estimation) of Bernoulli distribution. I will put the context first to get the intuition what the purpose of doing MLE in this case is. Let’s say, we will do $N$ times trials, and we expect to get $D=\begin{Bmatrix}1,\,1,\,0,\,....,\,1 \end{Bmatrix}$ in our $N$ trials. To do this, we can use MLE to maximize the likelihood of our trials to get $D=\begin{Bmatrix}1,\,1,\,0,\,....,\,1 \end{Bmatrix}$. Since Bernoulli distribution is independent and identical (i.i.d) in every trials, the probability of $D$ is a joint probability which can be calculated by the product of all the probabilities in each trials. Thus, $P(D)$ becomes: $P(D)=\prod_{n=1}^{N}p^{X_n}(1-p)^{1-X_n}$

We will transform it in $log$ form, because in $log$ form, the multiplication becomes addition, which makes our equation simpler when taking first differential. In $log$ form, equation above becomes: $P(D)=log(\prod_{n=1}^{N}p^{X_n}(1-p)^{1-X_n})\\\\ P(D)=\sum_{n=1}^{N}(log(p)^{X_n}+log(1-p)^{1-X_n})\\\\ P(D)=\sum_{n=1}^{N}({X_n}log(p)+\sum_{n=1}^{N}{(1-X_n)}log(1-p))$

We want to maximize $P(D)$ with respect to $p$, thus, we will take first differential w.r.t $p$, and make it equal to zero to find the best $p$ that maximizes $P(D)$. Here we go: $\underset{p}{argmax} \,[P(D))]\\\\ \frac{d[log(y)]}{dp}=\sum_{n=1}^{N}X_n \frac{1}{p}+(-1)\sum_{n=1}^{N}\frac{1}{1-p}=0\\\\ \frac{1}{p}\sum_{n=1}^{N}X_n -\frac{1}{1-p}\sum_{n=1}^{N}(1-X_n)=0\\\\ \frac{1}{p}\sum_{n=1}^{N}X_n =\frac{1}{1-p}\sum_{n=1}^{N}(1-X_n)\\\\ \frac{\sum_{n=1}^{N}1-\sum_{n=1}^{N}X_n}{1-p}=\frac{\sum_{n=1}^{N}X_n}{p}\\\\ \frac{N-\sum_{n=1}^{N}X_n}{1-p}=\frac{\sum_{n=1}^{N}X_n}{p}\\\\ p(N-\sum_{n=1}^{N}X_n)=(1-p)\sum_{n=1}^{N}X_n\\\\ p(N-\sum_{n=1}^{N}X_n)=\sum_{n=1}^{N}X_n-p\sum_{n=1}^{N}X_n\\\\ p(N-\sum_{n=1}^{N}+\sum_{n=1}^{N})=\sum_{n=1}^{N}X_n\\\\ \boxed{p=\frac{\sum_{n=1}^{N}X_n}{N}}$

## (2) Binomial distribution

Binomial distribution is the general form of Bernoulli distribution that the trial is binary outcome as well. But, instead of doing 1 trial, here, we do $n$ times trial. If we have $P(X=1)=p, \,P(X=0)=(1-p)=q$, thus, the probability for $X=1$ appears $k$ times in total $n$ times trial is as follows. $P(k; n,p) = \begin{bmatrix} n\\k\end{bmatrix}p^k(1-q)^{n-k}$

where $\begin{bmatrix} n\\k\end{bmatrix}=\frac{n!}{(n-k)!k!}$

Let’s try to plot the probability distribution in discret (knows as pmf – probability mass function) using equation above for some $p$ values and some $n$ trial values. The axis x is $k$ number, and the axis y is probability $P(k; n,p)$ value. Source: en.wikipedia.org

We can see that for many $n$ trials, the distribution shape is similar with Gaussian distribution. Again, for $n=1$ trial, it will be Bernoulli distribution like what we already discuss before.

Let’s derive the mean, variance and MLE (Maximum Likelihood Estimation). The mean of Binomial distribution can be derived as follows. $E[k]=\sum_{k=0}^{n}k\begin{bmatrix}n\\k\end{bmatrix}p^k(1-p)^{n-k}\\\\ E[k]=\sum_{k=0}^{n}k(\frac{n!}{(n-k)!k!})p^k\frac{p}{p}(1-p)^{(n-1)-(k-1)}\\\\ E[k]=\sum_{k=0}^{n}\not{k}(\frac{n(n-1)!}{((n-1)-(k-1))!\not{k}(k-1)!})p^{k-1}p(1-p)^{(n-1)-(k-1)}\\\\ E[k]=np\sum_{k=0}^{n}(\frac{(n-1)!}{((n-1)-(k-1))!(k-1)!})p^{k-1}(1-p)^{(n-1)-(k-1)}\\\\ E[k]=np\sum_{k=0}^{n}\begin{bmatrix}(n-1)\\(k-1)\end{bmatrix}p^{k-1}(1-p)^{(n-1)-(k-1)}$

Using Binomial theorem $(x+y)^n=\sum_{k=0}^{n}\begin{bmatrix}n\\k\end{bmatrix}x^k y^{n-k}$, we can re-write equation above becomes: $E[k]=np(p+(1-p))^n=np(1)^n=np\\\\ \boxed{E[k]=np}$

Again, in this case, $k$  is number of head appears $k$ times in $n$ times tossing coin.

Ok, let’s continue to derive the variance.
From Bernoulli process, we know that $Var(X)=pq$. When we see Binomial process as a sequence of Bernoulli process, our $X=(X_1+X_2+...+X_n)$ with $X_1, X2, ...X_n$ are independently Bernoulli distributed. Thus: $Var(X)=Var(X_1)+Var(X_2)+...+Var(X_n)=nVar(X_i)=npq = np(1-p)\\\\ \boxed{Var[X]=npq=np(1-p)}$

For MLE, in Binomial distribution case, we expect in $n$ trials, we get $k$ number of $X=1$. To maximize the likelihood of what we expect, we can derive in similar way with what we already do in Bernoulli distribution before. Here we go. $\underset{p}{argmax}\, P(k; n, p)=\underset{p}{argmax}\, \begin{bmatrix}n\\k\end{bmatrix}p^k(1-p)^{1-k}$

Since $\begin{bmatrix}n\\k\end{bmatrix}$ is a constant value, it will be thrown when we take first differential. Thus: $\underset{p}{argmax}\, P(k; n, p)=\underset{p}{argmax}\, p^k(1-p)^{1-k}$

The objective equation above is just exactly same with MLE in Bernoulli distribution we already do before. And we get $p$ that maximizes $P(k; n, p)$, $\boxed{p=\frac{\sum_{n=1}^{N}X_n}{N}}$.

## (3) Beta distribution

Beta distribution is probability distribution that is parameterized by two positive shape parameters, denoted by $\alpha, \, \beta$. In machine learning, this is important distribution because Beta distribution is conjugate prior distribution for Bernoulli distribution and Binomial distribution. This concept will be used in Bayesian inference. What is the detail? We discuss the detail of it here, but before it, we have to know Beta distribution first.

Mathematical formula for Beta distribution for $0\leq x \leq 1$, and shape parameters $\alpha, \, \beta > 0$ is written below. $P(x; \alpha, \beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}\\\\ P(x; \alpha, \beta)=\frac{1}{B(\alpha, \beta)}x^{\alpha-1}(1-x)^{\beta-1}$

where $\Gamma(x)$ is gamma function, and $B(\alpha, \beta)=\int_{0}^{1}u^{\alpha-1}(1-u)^{\beta-1}du$ is a normalization constant to ensure that the total probability integrates to 1.

Why the input is only ranging $0\leq x \leq 1$? Because, Beta distribution can be understood as representing a distribution of probabilities. It represents all the possible values of a probability when we don’t know what that probability is. That’s why it is ranging [0, 1], just like probability values. And here is Beta distribution plot with some $\alpha, \, \beta$ values. source : en.wikipedia.org