Probability is important tool in machine learning. During this article, we will discuss some important probability distributions when we want to learn so-called online learning/sequential learning which is a powerful learning method in machine learning based on Bayesian inference. We will start with a very simple distribution, which is Bernoulli Distribution.
(1) Bernoulli distribution
This is a very simple distribution that describes one trial experiment with binary outcomes, for example tossing coin with head and tail result. We can write the probability as follows.
where is a random variable mapping
and
Two equations above, we can write in one equation , where
.
An example of Bernoulli distribution is shown below.

Distribution above means chance of tossing coin resulting head ( is
and chance of resulting tail (
is
. Very simple, right?
Let’s us continue to derive the expected value (mean) and variance.
Now, let’s derive the MLE (Maximum Likelihood Estimation) of Bernoulli distribution. I will put the context first to get the intuition what the purpose of doing MLE in this case is. Let’s say, we will do times trials, and we expect to get
in our
trials. To do this, we can use MLE to maximize the likelihood of our trials to get
. Since Bernoulli distribution is independent and identical (i.i.d) in every trials, the probability of
is a joint probability which can be calculated by the product of all the probabilities in each trials. Thus,
becomes:
We will transform it in form, because in
form, the multiplication becomes addition, which makes our equation simpler when taking first differential. In
form, equation above becomes:
We want to maximize with respect to
, thus, we will take first differential w.r.t
, and make it equal to zero to find the best
that maximizes
. Here we go:
(2) Binomial distribution
Binomial distribution is the general form of Bernoulli distribution that the trial is binary outcome as well. But, instead of doing 1 trial, here, we do times trial. If we have
, thus, the probability for
appears
times in total
times trial is as follows.
where
Let’s try to plot the probability distribution in discret (knows as pmf – probability mass function) using equation above for some values and some
trial values. The axis x is
number, and the axis y is probability
value.

We can see that for many trials, the distribution shape is similar with Gaussian distribution. Again, for
trial, it will be Bernoulli distribution like what we already discuss before.
Let’s derive the mean, variance and MLE (Maximum Likelihood Estimation). The mean of Binomial distribution can be derived as follows.
Using Binomial theorem , we can re-write equation above becomes:
Again, in this case, is number of head appears
times in
times tossing coin.
Ok, let’s continue to derive the variance.
From Bernoulli process, we know that . When we see Binomial process as a sequence of Bernoulli process, our
with
are independently Bernoulli distributed. Thus:
For MLE, in Binomial distribution case, we expect in trials, we get
number of
. To maximize the likelihood of what we expect, we can derive in similar way with what we already do in Bernoulli distribution before. Here we go.
Since is a constant value, it will be thrown when we take first differential. Thus:
The objective equation above is just exactly same with MLE in Bernoulli distribution we already do before. And we get that maximizes
,
.
(3) Beta distribution
Beta distribution is probability distribution that is parameterized by two positive shape parameters, denoted by . In machine learning, this is important distribution because Beta distribution is conjugate prior distribution for Bernoulli distribution and Binomial distribution. This concept will be used in Bayesian inference. What is the detail? We discuss the detail of it here, but before it, we have to know Beta distribution first.
Mathematical formula for Beta distribution for , and shape parameters
is written below.
where is gamma function, and
is a normalization constant to ensure that the total probability integrates to 1.
Why the input is only ranging ? Because, Beta distribution can be understood as representing a distribution of probabilities. It represents all the possible values of a probability when we don’t know what that probability is. That’s why it is ranging [0, 1], just like probability values. And here is Beta distribution plot with some
values.
