# Some Important Information Theory Concepts in Machine Learning

We need to talk about some information theory concepts that are important in machine learning. For illustration, in polynomial regression using least square estimation we already talk here, we define our error as “square error” $\sum_{i=1}^{m}(h(x_i)-y_i)^2$. In information theory point of view, we can define as an entropy for example, that is used in learning process of decision tree. Ok, let’s talk one by one those important concepts.

## (1) Unit of information

How to measure an information? Shannon propose well-known concept that is used until now, namely Shannon entropy. In short, we can say entropy is how many information units (such as bit) are needed to represent certain information or event. Or in more formal way, we can define as “the average amount of information produced by a probabilistic stochastic source of data”.

For example we have a tossing coin event with $P(X=head)=1/2$ and $P(X=tail)=1/2$, how many units of information we need to represent that event (random variable $X$)? We can measure it using this Shannon entropy as follows. $\boxed{H(X)=-\sum_{i=1}^{n}P(x_i)log_b(P(x_i))}$

Using $b = 2$ (basis of log), we get: $H(X)=-\frac{1}{2}log_{2}(\frac{1}{2})-\frac{1}{2}log_{2}(\frac{1}{2})=1$

Because we use $b = 2$, which is same with binary system, we can say that the result above is in unit bit. So, to represent that tossing coin event, we need 1 bit. Next, what is another meaning behind this? See table below providing some samples of event 1, 2, and 3 whose outcomes are ternary with each probability value $P_1, P_2, P(3)$.

Table above shows that the distribution which maximizes entropy is uniform distribution, which is event 1. And more variation the probability values, lower entropy it is, a.k.a less unit number (e.g: bit) we need to represent that event. It makes sense, if you know Huffman coding, event 3 will also have least average length of bit.

## (2) Proofing that uniform distribution maximizes entropy

Let’s try to prove mathematically that uniform distribution is the distribution of probabilities that maximizes entropy. $H(X)=-\int_{a}^{b}P(X)log(P(X))dx$ $H(X)$ will be maximal when $P(X)$ is uniform distribution?

We can prove it using calculus of variation, but it will be relatively difficult. Another way to prove it is using Lagrange multiplier. In short, it is an optimization method to find maximal/minimal value subject to equality constraints. In this case, our constraint is $\int_{a}^{b}P(X)dx=1$. By adding this constraint to our equation to be maximized, it will be as follows, and let it denoted by $J(P(x))$ just to make it easier to explain. $-[\int_{a}^{b}P(x)ln(P(x))dx]+ \alpha(\int_{a}^{b}P(x)dx - 1) = J(P(x))$

The second term is our equality constraint, and we have $\alpha$, that is our Lagrange multiplier. And here, we change $log$ to $ln$ to make the proofing easier, since the basis $b$ in entropy is freely defined up to us.  We will take first differential w.r.t $P(x)$, and make it equal to zero to get the extrema value. Here we go. $-[\int_{a}^{b}P(x)ln(P(x))dx]+\alpha(\int_{a}^{b}P(x)dx-1)\\\\ =-[\sum_{i=a}^{b}P(x_i)ln(P(x_i))]+\alpha(\sum_{i=a}^{b}P(x_i)-1)$

We change the integral to sigma (continuous to discrete) first, so that we can easily get the intuitive meaning during this proofing process. By proceeding the process, we get. $\Leftrightarrow\frac{-[\sum_{i=a}^{b}P(x_i)ln(P(x_i))]+\alpha(\sum_{i=a}^{b}P(x_i)-1)}{d(P(x_j))}=0\,\,(i)\\\\ \Leftrightarrow-\sum_{i=a}^{b}[(\frac{\delta P(x_i)}{\delta P(x_j)})ln(P(x_i))+P(x_i)\frac{\delta ln(P(x_i))}{\delta P(x_j)}]+[\alpha \sum_{i=a}^{b} \frac{\delta P(x_i)}{\delta P(x_j)}-\frac{\delta (1)}{\delta P(x_j)}]=0\,\,(ii)\\\\ \Leftrightarrow-\sum_{i=a}^{b}[1.ln(P(x_i))\delta_{ij}+P(x_i)\frac{1}{P(x_i)}\delta_{ij}]+[\alpha \sum_{i=a}^{b}1.\delta_{ij}-0]=0\,\,(iii)\\\\ \Leftrightarrow-[ln(P(x_i))+1]+\alpha=0\,\,(iv)\\\\ \Leftrightarrow ln(P(x_i))=\alpha-1\\\\ \Leftrightarrow P(x_i)=e^{\alpha-1}$

From $(i)$ to $(ii)$, the normal differential ( $d$) changes to partial differential ( $\delta$) because that is special case of Leibniz rule. Again, from $(i)$ to $(ii)$, in the first term, we use differential product rule $(uv)' = u'v+uv'$. In $(iii)$, we get Kronecker delta $\delta_{ij}$. We can think that sigma is summation expansion, so that we take differential for each term in that summation expansion, and we will only have differential value when index $i=j$. Thus, Kronecker delta will cancel out sigma summation and we get $(iv)$ from $(iii)$.

And the last step, to get what $P(X)$ is, we can just re-plug $P(x)$ to our constraint. $\int_{a}^{b}P(x)dx = 1\\\\ \int_{a}^{b}e^{\alpha-1}dx = 1 \\\\ e^{\alpha-1}x|_a^b = 1\\\\ e^{\alpha-1}(b-a)=1\\\\ e^{\alpha-1}=\boxed{ P(x)=\frac{1}{b-a} }$

Voila! We just successfully proved that $P(x)$ that maximizes our $J(P(x))$ is uniform distribution $\frac{1}{b-1}$.

## (3) Given additional constrain, mean, what is the distribution maximizing entropy?

Let’s give additional constrain to our $J(P(x))$, which is mean $\mu$ as follows. $\int_{0}^{\infty}P(x)dx = \mu$

Our $J(P(x))$ becomes. $J(P(x))\int_{0}^{\infty}P(x)ln(P(x))dx+\alpha_0 (\int_{0}^{\infty}(P(x))-1)+\alpha_1(\int_{0}^{\infty}xP(x)dx-\mu)$

Again, we will take the first differential w.r.t $P(x)$, and make it equal to zero to find the extrema value. We already familiar to do functional differentiation (differentiate w.r.t a function) in the previous proofing, thus, we just skip the detail for later proofing. Here we go. $\frac{d[J(P(x))]}{d(P(x))} = -(1+ln(P(x)))+\alpha_0+\alpha_1x=0\\\\ ln(P(x))=\alpha_0+\alpha_1x-1 \\\\ P(x)=e^{\alpha_0+\alpha_1 x-1}$

Next, we will find $\alpha_0, \, \alpha_1$ by re-plugging $P(x)$ to our two constraints. Here we go. $\int_{0}^{\infty}P(x)dx = 1\\\\ \int_{0}^{\infty}e^{\alpha_0+\alpha_1 x-1} dx=1 \\\\ \int_{0}^{\infty}(e^{\alpha_0-1}).e^{\alpha_1x}dx = 1\\\\ e^{\alpha_0-1}\int_{0}^{\infty}e^{\alpha_1 x}dx = 1 \\\\ e^{\alpha_0-1}(\frac{1}{\alpha_1}e^{\alpha_1x}|_0^\infty)=1\\\\ e^{\alpha_0-1}(\frac{1}{\alpha_1}e^{\alpha_1\infty}-\frac{1}{\alpha_1}e^{\alpha_10})=1$

To be converged, $\alpha_1$ has to be negative. Thus: $e^{\alpha_0-1}(0-\frac{1}{\alpha_1})=1\\\\ \boxed {\alpha_1=-e^{\alpha_0-1}}$

We proceed to plug $P(x)$ to our second constraint. $\int_{0}^{\infty}xP(x)=\mu \\\\ \int_{0}^{\infty}xe^{\alpha_0+\alpha_1x-1}dx=\mu \\\\ e^{\alpha_0-1}\int_{0}^{\infty} xe^{\alpha_1x}dx = \mu$

We will use integral by parts ( $\int_{a}^{b} udv = uv|_a^b - \int_{a}^{b} vdu$) to solve equation above. Let $u=x$ and $dv=e^{\alpha_1 x}dx$. Thus: $u=x \Leftrightarrow \frac{du}{dx}=\frac{dx}{dx}=1\\\\ v=\int e^{\alpha_1x}dx = \frac{1}{\alpha_1}e^{\alpha_1x}$

Re-plugging to our last equation, we get: $e^{\alpha_0-1}[x\frac{1}{\alpha_1}e^{\alpha_1x}|_0^{\infty}-\int_{0}^{\infty} \frac{1}{\alpha_1}e^{\alpha_1x}dx]=\mu$

Again, equation above will be converged when $\alpha_1$ is negative, and as a consequence, the first term will be zero. Let’s continue to re-write again. $e^{\alpha_0-1}[0-\int_{0}^{\infty} \frac{1}{\alpha_1}e^{\alpha_1x}dx]=\mu\\\\ -e^{\alpha_0-1}\int_{0}^{\infty} \frac{1}{\alpha_1}e^{\alpha_1x}dx=\mu\\\\ -\frac{1}{\alpha_1}\int_{0}^{\infty}e^{\alpha_0+\alpha_1x-1}dx=\mu\\\\ -\frac{1}{\alpha_1}\int_{0}^{\infty}P(x)dx=\mu\\\\ -\frac{1}{\alpha_1}1=\mu\\\\ \boxed {\alpha_1 = -\frac{1}{\mu}}$

Yes, we almost get it! Our $P(x) = e^{\alpha_0+\alpha_1x-1}=e^{\alpha_0-1}e^{\alpha_1}$, where $e^{\alpha_0-1}=-\alpha_1$ according to what we already get before, in solving constraint 1. Thus, our $P(x)$ is: $P(x)=e^{\alpha_0-1}e^{\alpha_1}=-\alpha_1 e^{\alpha_1}\\\\ \boxed{P(x)=\frac{1}{\mu}e^{-\frac{1}{\mu}x}}$

According to the final $P(x)$ we get, we can say that given additional constrain $\mu = mean$, probability that maximizes entropy is exponential distribution.

## (4) Given additional constrain again, variance, what is the distribution maximizing entropy?

In previous discussion, we already proof that given additional constraint mean, exponential distribution maximizes entropy. Now, what if we are given additional constraint again, which is variance? Here is our additional equality constraint. $\int x^2P(x)=\sigma^2$

We already derive with two constraints (sum of probability =1 and mean constraint), thus, by adding third term with our variance constraint, we will get as follows. $P(x)=e^{-1+\alpha_0+\alpha_1x+\alpha_2x^2}$

To find $\alpha_0, \alpha_1, \alpha_1$, we can use Jacobian and polar coordinate system. By doing that, we will get that the distribution that maximizes entropy in this case is Gaussian distribution. That’s why Gaussian is beautiful and important distribution. In later discussion, about online learning, we will encounter that conjugate of Gaussian is another Gaussian (self-conjugate). Interesting, right?

## (5) Conditional entropy

We already familiar with the term “conditional” in deriving Naive Bayes here. It’s similar in the entropy here. Conditional entropy is the amount of information needed to describe the outcome of a random variable given that the value of another random variable is known.

Given random variable $X$, conditional entropy of $Y$ given $X$ is defined as sum of $H(Y|X=x)$ for each possible of $x$, using $P(x)$ as the weights. Let’s write and derive it. $H(Y|X) \equiv \sum_{x}^{}P(x)H(Y|X=x)\,\,\,(i)\\\\ H(Y|X) = -\sum_{x}^{}P(x)\sum_{y}^{}P(y|x)logP(y|x)\,\,\,(ii)\\\\ H(Y|X) = -\sum_{x}^{}\sum_{y}^{}p(x \cap y) logP(y|x)\,\,\,(iii)\\\\ H(Y|X) = -\sum_{x, y}^{} P(x\cap y)logP(y|x)\,\,\,(iv)\\\\ H(Y|X) = -\sum_{x, y}^{}P(x \cap y)log\frac{P(x \cap y)}{P(x)}\,\,\,(v)\\\\ \boxed{H(Y|X) = \sum_{x, y}^{}P(x \cap y)log\frac{P(x)}{P(x \cap y)}}\,\,\,(v)$

From $(i)$ to $(ii)$, we substitute using definition of entropy. From $(ii)$ to $(iii)$, we use Bayes’ formula. From $(iv)$ to $(v)$, we also use Bayes’ formula.

## (5) Relative entropy

Relative entropy, also called Kullback–Leibler divergence, is a measure of how one probability distribution diverges from the other one. Relative entropy $Q$ to $P$ is defined as follows. $KL(P||Q)=[-\sum P(x)logQ(x)]-[-\sum P(x)logP(x)] = -\sum P(x)log \frac{Q(x)}{P(x)}\\\\ \boxed{KL(P||Q)=\sum P(x) log \frac{P(x)}{Q(x)}}$

Note that $KL(P||Q) \not{=} KL(Q||P)$. Thus, for implementation, we can come up with: $P||Q = \frac{KL(P||Q) + KL(Q||P)}{2}$

## (6) Mutual information

Mutual information of two random variables measures the mutual dependence between those two variables. More specifically, it quantifies the “amount of information” (e.g: bits) obtained about one random variable, through the other random variable. Mathematically, it is defined as follows. $\boxed{I(x \cap y) = H(x)-H(x|y)}$

## 2 thoughts on “Some Important Information Theory Concepts in Machine Learning”

1. teguhteja says:

i look this difficult material.
i hope there is a conclusion

2. Pingback: Lashawn Pipilas