Probability is important tool in machine learning. During this article, we will discuss some important probability distributions when we want to learn so-called online learning/sequential learning which is a powerful learning method in machine learning based on Bayesian inference. We will start with a very simple distribution, which is Bernoulli Distribution.

## (1) Bernoulli distribution

This is a very simple distribution that describes one trial experiment with binary outcomes, for example tossing coin with head and tail result. We can write the probability as follows.

where is a random variable mapping and

Two equations above, we can write in one equation , where .

An example of Bernoulli distribution is shown below.

Distribution above means chance of tossing coin resulting head ( is and chance of resulting tail ( is . Very simple, right?

Let’s us continue to derive the expected value (mean) and variance.

Now, let’s derive the MLE *(Maximum Likelihood Estimation)* of Bernoulli distribution. I will put the context first to get the intuition what the purpose of doing MLE in this case is. Let’s say, we will do times trials, and we expect to get in our trials. To do this, we can use MLE to maximize the likelihood of our trials to get . Since Bernoulli distribution is independent and identical (i.i.d) in every trials, the probability of is a joint probability which can be calculated by the product of all the probabilities in each trials. Thus, becomes:

We will transform it in form, because in form, the multiplication becomes addition, which makes our equation simpler when taking first differential. In form, equation above becomes:

We want to maximize with respect to , thus, we will take first differential w.r.t , and make it equal to zero to find the best that maximizes . Here we go:

## (2) Binomial distribution

Binomial distribution is the general form of Bernoulli distribution that the trial is binary outcome as well. But, instead of doing 1 trial, here, we do times trial. If we have , thus, the probability for appears times in total times trial is as follows.

where

Let’s try to plot the probability distribution in discret (knows as *pmf – probability mass function*) using equation above for some values and some trial values. The axis x is number, and the axis y is probability value.

We can see that for many trials, the distribution shape is similar with Gaussian distribution. Again, for trial, it will be Bernoulli distribution like what we already discuss before.

Let’s derive the mean, variance and MLE (Maximum Likelihood Estimation). The mean of Binomial distribution can be derived as follows.

Using Binomial theorem , we can re-write equation above becomes:

Again, in this case, is number of head appears times in times tossing coin.

Ok, let’s continue to derive the variance.

From Bernoulli process, we know that . When we see Binomial process as a sequence of Bernoulli process, our with are independently Bernoulli distributed. Thus:

For MLE, in Binomial distribution case, we expect in trials, we get number of . To maximize the likelihood of what we expect, we can derive in similar way with what we already do in Bernoulli distribution before. Here we go.

Since is a constant value, it will be thrown when we take first differential. Thus:

The objective equation above is just exactly same with MLE in Bernoulli distribution we already do before. And we get that maximizes , .

## (3) Beta distribution

Beta distribution is probability distribution that is parameterized by two positive shape parameters, denoted by . In machine learning, this is important distribution because Beta distribution is conjugate prior distribution for Bernoulli distribution and Binomial distribution. This concept will be used in Bayesian inference. What is the detail? We discuss the detail of it here, but before it, we have to know Beta distribution first.

Mathematical formula for Beta distribution for , and shape parameters is written below.

where is gamma function, and is a normalization constant to ensure that the total probability integrates to 1.

Why the input is only ranging ? Because, Beta distribution can be understood as representing a distribution *of probabilities.* It represents all the possible values of a probability when we don’t know what that probability is. That’s why it is ranging [0, 1], just like probability values. And here is Beta distribution plot with some values.