# Deriving Naive Bayes Classifier

Naive Bayes classifier is constructed under Bayesian framework. To understand Bayesian framework, we will start to discuss conditional probability first. See picture below.

Picture above is Venn Diagram for probability of event A and B which intersect each other. Probability A can be calculate by dividing $area \, A$ with $area \, S$ (universe, all rectangle area). Likewise, Probability B is $\, area \, B$ divided by $area \, A$. The intersection area is when both event A and event B occur together. The probability of it, event A and event B occur together, is called joint probability, denoted by $P(A\cap B)$$P(A\cap B)$ can be calculated by dividing $area \,(A\cap B)$ by $area \, S$.

Conditional probability is the probability of certain event when another event already occur, for example probability of event A when event B already occur. We can write probability of event A when event B already occur by $P(A|B)$. We can calculate $P(A|B)$ as follow.

$P(A|B)=\frac{P(A\cap B)}{P(B)}$

Formula above makes sense and is easy to remember. Probability of event A which event B already occur means given area B (green circle), thus how large area A in area B compared to area B? Yes, by dividing the intersection area with area B, exactly same with equation above. This formula is then known as Bayes’ theorem. We can also write in another form, that is probability of event B that event A already occur as follows.

$P(B|A)=\frac{P(B\cap A)}{P(A)}$

Since $(B\cap A)$ is the intersection between B and A, thus $(B\cap A)=(A\cap B)$. We can write combine two equations above as follows.

$P(B\cap A)=P(A\cap B)\\\\ P(B|A)P(A)=P(A|B)P(B)\\\\ P(B|A)=\frac{P(A|B)P(B)}{P(A)}$

Equation above is powerful equation is machine learning. Because, by knowing the probability of event A that event B already occurs, probability of event A and B, we can predict probability of event B given attribute A. Let put an example for clearer understanding. Let’s say in a village, Ronald is getting fever. We can predict the probability of Ronald is getting typhus disease. To do that, we have to collect some data in that village as follows.

• $P(fever| typhus) =$ for example for all people that are getting typhus disease, 98% of the are getting fever
• $P(fever) =$ for example in that village, 10% people are getting fever
• $P(typhus) =$ for example in that village, 5% people are getting fever

Thus, we can estimate the probability of Ronald getting typhus disease as follows.

$P(Ronald=typhus|fever)=\frac{P(fever|typhus)P(typhus)}{P(fever)}\\\\ P(Ronald=typhus|fever)=\frac{0.99 \times 0.05}{0.1}\\\\ P(Ronald=typhus|fever)=0.49$

That’s cool, right? For this case, this is still simple to make our prediction accurate. Because in this case, we only use one attribute to predict whether Ronald getting typhus or not, which is fever. For more accurate prediction, we need more attributes, such as fever or not, body temperature, cough or not, how long he has been getting fever, does someone in their family also ever get typhus, and so on. More related attribute we can collect, more accurate our prediction. Here is further Bayes’ formula for more attributes. Let attribute (such as fever, body temperature, and so on) denoted by $\theta_1, \, \theta_2, \,\theta_3, \,...\, ,\theta_n$ and class prediction (such as typhus, not typhus and so on) denoted by $c_1, \, c_2, \, c_3, \, ... \, ,c_n$. Here is our Bayes formula.

$P(c_i|\theta_1,\,\theta_2,\,...,\,\theta_n)=\frac{P(\theta_1,\,\theta_2,\,...,\,\theta_n|c_i)P(c_i)}{P(\theta_1,\,\theta_2,\,...,\,\theta_n)}$ ,

where $P(\theta_1,\,\theta_2,\,...,\,\theta_n)=P(\theta_1 \cap \theta_2 \cap ,\,...\,,\cap \theta_n)$

It will be difficult to collect data $P(\theta_1,\,\theta_2,\,...,\,\theta_n|c_i)$ and $P(\theta_1,\,\theta_2,\,...,\,\theta_n)$  in real life. For example, probability when people is getting fever, also getting cough, also has body temp 38C, also has family member that ever gets typhus and so on. Thus, we will make an assumption to make equation above implementable in real life. We will use  conditional independent as our assumption. Before that, we will discuss what independent and conditional independent are.

Independent events means an event that is not effected by previous event, for example when we toss a coin, we get head in the first toss, and for the second toss, the result whether head or tail is not affected by the first toss. In mathematical formula, we can write independent probability as follows.

$P(A,B)=P(A)P(B)$

Thus, for conditional probability, the formula will be.

$P(A,B|C)=P(A|C)P(B|C)$

Under conditional independent assumption, our Bayes formula becomes.

$P(c_i|\theta_1,\,\theta_2,\,...,\,\theta_n)=\frac{P(\theta_1,\,\theta_2,\,...,\,\theta_n|c_i)P(c_i)}{P(\theta_1,\,\theta_2,\,...,\,\theta_n)}\\\\ P(c_i|\theta_1,\,\theta_2,\,...,\,\theta_n)=\frac{[P(\theta_1|c_i)P(\theta_2|c_i)P(\theta_3|c_i)...P(\theta_n|c_i)]P(c_i)}{P(\theta_1,\,\theta_2,\,...,\,\theta_n)}\\\\ P(c_i|\theta_1,\,\theta_2,\,...,\,\theta_n)=\frac{P(c_i)[\prod_{m=1}^{n}P(\theta_m|c_i)]}{P(\theta_1,\,\theta_2,\,...,\,\theta_n)}$

Voila! We just successfully derived for Bayes formula for classifier with many attributes. We can just calculate $P(c_i|\theta_1,\,\theta_2,\,...,\,\theta_n)$ for all $c_i$, and the class prediction is the $c_i$ with maximal value of $P(c_i|\theta_1,\,\theta_2,\,...,\,\theta_n)$ . This classifier then is called Naive Bayes. In formal from, we can write as follows.

$class=\underset{c_i}{argmax} \frac{P(c_i)[\prod_{m=1}^{n}P(\theta_m|c_i)]}{P(\theta_1,\,\theta_2,\,...,\,\theta_n)}$

And since the denominator for all $c_i$ are same, we can simplify as follows. And this is our final Naive Bayes classifier.

$\boxed{class=\underset{c_i}{argmax} P(c_i)[\prod_{m=1}^{n}P(\theta_m|c_i)]}$

*Notes :

If you are interested, for the denominator, we can still expand using chain rule as follows. But again, we will not use it in this case. Just if you are curious :p

$P(\theta_1,\,\theta_2,\,...,\,\theta_n)=P(\theta_1|\theta_2,\,...,\,\theta_n) P(\theta_2,\,...,\,\theta_n)\\\\ P(\theta_1,\,\theta_2,\,...,\,\theta_n)=P(\theta_1|\theta_2,\,...,\,\theta_n) P(\theta_2|\theta_3\,...,\,\theta_n)P(\theta_3\,...,\,\theta_n)\\\\ P(\theta_1,\,\theta_2,\,...,\,\theta_n)=P(\theta_1|\theta_2,\,...,\,\theta_n) P(\theta_2|\theta_3\,...,\,\theta_n)...P(\theta_{n-2}|\theta_{n-1}\,...,\,\theta_n)P(\theta_n)$

Using conditional property here, we can simplify become:

$P(\theta_1,\,\theta_2,\,...,\,\theta_n)=P(\theta_1|\theta_n) P(\theta_2|\theta_n)...P(\theta_{n-2}|\theta_n)P(\theta_n)$