# How Logistic Regression Works for Classification (with Maximum Likelihood Estimation Derivation)

Logistic regression is an extension of regression method for classification. In the beginning of this machine learning series post, we already talked about regression using LSE here. To use regression approach for classification,we will feed the output regression $Y$ into so-called activation function, usually using sigmoid acivation function. See piture below.

Sigmoid function will have output with s-shape like picture above whose output range is from zero to one. For classification, logistic regression is originally intended for binary classification. Regarding picture above, our output regression $Y$ is fed sigmoid activation function. We will classify input to $class_1$ when the output is closed to 1 (formally when $output >0.5$) and classify to $class_2$ when the output is closed to 0 (formally when $output \leq 0.5$) To do that, we can achieve by maximizing out likelihood using MLE (Maximum Likelihood Estiamtion).

We know that the output of our activation function $f(\textbf{W}^T\textbf{x})$ is ranging from 0 to 1, and we also know that linear regression is originally intended for binary classification (two outcomes). Therefore, we can model our likelihood using product of Bernoulli distribution. If you are not really familiar with Bernoulli distribution, we already talked here and you can take a look of it. Given m-pair data training $D=\begin{Bmatrix} (\boldsymbol{x_1}, y_1),(\boldsymbol{x_2}, y_2),...,(\boldsymbol{x_m}, y_m)\end{Bmatrix}$, $\boldsymbol{x_i}$ here can be a vector such as $\boldsymbol{x_i}=\begin{Bmatrix}x_{i1},x_{i2},...,x_{in}\end{Bmatrix}$. We can think that our each data $\boldsymbol{x_i}$, we have $n-features$. And for $y_i$, it is only $0$ or $1$ (binary class). Thus, our likelihood is product of Bernoulli distribution defined as follows.

$Likelihood = \prod_{i=1}^{m}P^{Y_i}(1-P)^{(1-Y_i)}\\\\ Likelihood = \prod_{i=1}^{m}f(\boldsymbol{W^Tx_i})^{Y_i}(1-f(\boldsymbol{W^Tx_i}))^{(1-Y_i)}\\\\ Likelihood =\prod_{i=1}^{m}(\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}})^{Y_i}(1-(\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}}))^{(1-Y_i)}\\\\ Likelihood =\prod_{i=1}^{m}(\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}})^{Y_i}(\frac{1+e^{-\boldsymbol{W^Tx_i}}}{1+e^{-\boldsymbol{W^Tx_i}}}-(\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}}))^{(1-Y_i)}\\\\ Likelihood =\prod_{i=1}^{m}(\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}})^{Y_i}(\frac{e^{-\boldsymbol{W^Tx_i}}}{1+e^{-\boldsymbol{W^Tx_i}}})^{(1-Y_i)}\\\\ Likelihood\,in\,log\,form = \sum_{i=1}^{m}Y_i \, ln(\frac{1}{1+e^{\boldsymbol{W^T}x_i}})+(1-Y_i)ln(\frac{e^{\boldsymbol{W^Tx_i}}}{1+e^{\boldsymbol{W^Tx_i}}})$

To find $\textbf{W}$ that maximizes likelihood, we can take the first differential of our likelihood w.r.t $\textbf{W}$, and make it equals to zero. As usual, to make it easier, we bring our likelihood to log form shown in the last line equation above.

$MLE = \underset{\textbf{W}}{argmax}\, [log\, likelihood]\\\\ \frac{\delta [log\,likelihood]}{\delta\textbf{W}}=\frac{\delta[\sum_{i=1}^{m}Y_i \, ln(\frac{1}{1+e^{-\boldsymbol{W^T}x_i}})+(1-Y_i)ln(\frac{e^{-\boldsymbol{W^Tx_i}}}{1+e^{-\boldsymbol{W^Tx_i}}})]}{\delta\textbf{W}}=0\\\\ \sum_{i=1}^{m}Y_i\frac{\delta[\, ln(\frac{1}{1+e^{-\boldsymbol{W^T}x_i}})]}{\delta\textbf{W}}+(1-Y_i)\frac{\delta\, [ln(\frac{e^{-\boldsymbol{W^Tx_i}}}{1+e^{-\boldsymbol{W^Tx_i}}})]}{\delta\textbf{W}}=0\\\\ \sum_{i=1}^{m}Y_i\frac{\delta[\, ln(1+e^{-\boldsymbol{W^T}x_i})^{-1}]}{\delta\textbf{W}}+(1-Y_i)\frac{\delta\, [ln(e^{-\boldsymbol{W^Tx_i}})-ln(1+e^{-\boldsymbol{W^Tx_i}})]}{\delta\textbf{W}}=0\\\\ \sum_{i=1}^{m}-Y_i\frac{\delta[\, ln(1+e^{-\boldsymbol{W^T}x_i})]}{\delta\textbf{W}}+(1-Y_i)\frac{\delta\, [ln(e^{-\boldsymbol{W^Tx_i}})-ln(1+e^{-\boldsymbol{W^Tx_i}})}{\delta\textbf{W}}=0\\\\ \sum_{i=1}^{m}-Y_i(\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}}e^{-\boldsymbol{W^Tx_i}}.(-\boldsymbol{x_i}))\\+(1-Y_i)(-\boldsymbol{x_i}-(\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}}e^{-\boldsymbol{W^Tx_i}}.(-\boldsymbol{x_i})))=0$

$\sum_{i=1}^{m}Y_i\boldsymbol{x_i}(\frac{e^{-\boldsymbol{W^Tx_i}}}{1+e^{-\boldsymbol{W^Tx_i}}})+(1-Y_i)(-\boldsymbol{x_i}(1-(\frac{e^{-\boldsymbol{W^Tx_i}}}{1+e^{-\boldsymbol{W^Tx_i}}})))=0\\\\ \sum_{i=1}^{m}Y_i\boldsymbol{x_i}(1-\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}})-(1-Y_i)(\boldsymbol{x_i}(\frac{1+e^{-\boldsymbol{W^Tx_i}}}{1+e^{-\boldsymbol{W^Tx_i}}}-(\frac{e^{-\boldsymbol{W^Tx_i}}}{1+e^{-\boldsymbol{W^Tx_i}}})))=0\\\\ \sum_{i=1}^{m}Y_i\boldsymbol{x_i}-Y_i\boldsymbol{x_i}\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}}-(1-Y_i)(\boldsymbol{x_i}(\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}}))=0\\\\ \sum_{i=1}^{m}Y_i\boldsymbol{x_i}-Y_i\boldsymbol{x_i}\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}}-\boldsymbol{x_i}(\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}})+Y_i\boldsymbol{x_i}(\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}})=0\\\\ \sum_{i=1}^{m}Y_i\boldsymbol{x_i}-\boldsymbol{x_i}(\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}})=0\\\\ \sum_{i=1}^{m}\boldsymbol{x_i}(Y_i-\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}})=0, \, or\,we\,can\,flip \\\\ \sum_{i=1}^{m}\boldsymbol{x_i}(\frac{1}{1+e^{-\boldsymbol{W^Tx_i}}}-Y_i)=0$

Whoops! We are stuck here. We cannot find analytical function of $\textbf{W}$ that maximizes our likelihood, unlike what we did in LSE regression here that we could derive analytical form of $\textbf{W}$ that maximizes our loss function.

But, we have other methods to achieve this, maximizing our likelihood. We can use (1) gradient descent we already talk here, or (2) newton method for optimization we already talk here. Let’s break down for clearer understanding.

## Using gradient descent to maximize likelihood in logistic regression

From our discussion here, we know that gradient descent formula is $x_{j+1}=x_j-\lambda \frac{df(x)}{dx}$. In our logistic regression above, it will be.

$x_{j+1}=x_j-\lambda \frac{df(x)}{dx}\\\\ \textbf{W}_{j+1}=\textbf{W}_j-\lambda \frac{d[likelihood]}{d\textbf{W}}\\\\ \boxed{\textbf{W}_{j+1}=\textbf{W}_j-\lambda \sum_{i=1}^{m}\boldsymbol{x_i}(\frac{1}{1+e^{-\boldsymbol{W_n^Tx_i}}}-Y_i)}$

We get the last line above since we already derive the first differential of our likelihood w.r.t $\textbf{W}$. Given m-pair data training $(x,y)$ and n-order polynomial basis, we can write in matrix notation as follows.

$\boxed{\textbf{W}_{j+1}=\textbf{W}_j-\lambda \textbf{X}^T(\boldsymbol{f(W^Tx)}-\textbf{Y})}$

where $\textbf{X}=\begin{bmatrix}1 &x_{11} & x_{12} &...&x_{1n}\\1 &x_{21} & x_{22} &...&x_{2n}\\\vdots &\vdots& \vdots &\ddots&\vdots\\1 &x_{m1} & x_{m2} &...&x_{mn}\end{bmatrix},\,\, \boldsymbol{f(W^Tx)} =\begin{bmatrix}\frac{1}{1+e^{-\boldsymbol{W_j^Tx_1}}}\\\\\frac{1}{1+e^{-\boldsymbol{W_j^Tx_2}}}\\\vdots\\\frac{1}{1+e^{-\boldsymbol{W_j^Tx_m}}}\end{bmatrix}$

$\boldsymbol{x_i} =\begin{bmatrix}1\\\\x_{i1}\\\vdots\\x_{in}\end{bmatrix}\,\,\, , \boldsymbol{Y} =\begin{bmatrix}Y_1\\\\Y_2\\\vdots\\Y_m\end{bmatrix},\,\,and\,\, \boldsymbol{W} =\begin{bmatrix}W_1\\\\W_2\\\vdots\\W_n\end{bmatrix}$.

And finally we can just this formula to train our logistic regression iteratively. In every iteration, we will try to maximize the likelihood until we get converged for our $\boldsymbol{W}$.

## Using newton method to maximize likelihood in logistic regression

From our discussion about newton method for optimization here, we know that the formula is $x_{n+1}=x_n-\frac{f'(x_n)}{f''(x_n)}$. It is for scalar form of $x_n$. When $\boldsymbol{x}_n$ is a matrix, we can write $\boldsymbol{x}_{n+1}=\boldsymbol{x}_n-(H)^{-1}f'(x_n)$, where $H$ is hessian matrix, that basically is second derivative of $f(x)$.

Using our newton method for formula above for optimization, we will try to derive the final formula for iteration update in our logistic regression case. Here we go.

$\boldsymbol{W}_{n+1}=\boldsymbol{W}_n-(H)^{-1}f'(x_n)\\\\ \boldsymbol{W}_{n+1}=\boldsymbol{W}_n-(\frac{\delta[\sum_{i=1}^{m}\boldsymbol{x_i}(\frac{1}{1+e^{-\boldsymbol{W_n^Tx_i}}}-Y_i)]}{\delta \textbf{W}})^{-1}f'(x_n)\\\\ \boldsymbol{W}_{n+1}=\boldsymbol{W}_n-((\frac{\delta[\sum_{i=1}^{m}(\frac{\boldsymbol{x_i}}{1+e^{-\boldsymbol{W_n^Tx_i}}})]}{\delta \textbf{W}}-\frac{\delta \boldsymbol{x_i}Y_i}{\delta \textbf{W}}))^{-1} f'(x_n)\\\\ \boldsymbol{W}_{n+1}=\boldsymbol{W}_n-((\frac{\delta[\sum_{i=1}^{m}(\frac{\boldsymbol{x_i}}{1+e^{-\boldsymbol{W_n^Tx_i}}})]}{\delta \textbf{W}}-0))^{-1} f'(x_n)\\\\ \boldsymbol{W}_{n+1}=\boldsymbol{W}_n-((\frac{\delta[\sum_{i=1}^{m}\boldsymbol{x_i}(1+e^{-\boldsymbol{W_n^Tx_i}})^{-1}]}{\delta \textbf{W}}))^{-1} f'(x_n)$

To solve the differential term $\frac{\delta[(1+e^{-\boldsymbol{W_n^Tx_i}})^{-1}]}{\delta \textbf{W}}$, we can use chain rule in differentiation. Let $u=1+e^{-\boldsymbol{W^Tx_i}}$, and $v=\boldsymbol{W^Tx_i}$.

$\frac{\delta[(1+e^{-\boldsymbol{W_n^Tx_i}})^{-1}]}{\delta \textbf{W}}=\frac{\delta u^{-1}}{\delta u}\frac{\delta u}{\delta v}\frac{\delta v}{\delta \textbf{W}}\\\\ \frac{\delta[(1+e^{-\boldsymbol{W_n^Tx_i}})^{-1}]}{\delta \textbf{W}}=-u^{-2}\frac{\delta [1+e^{-\boldsymbol{W^Tx_i}}]}{\delta [\boldsymbol{W^Tx_i}]}\frac{\delta [\boldsymbol{W^Tx_i}]}{\delta \textbf{W}}\\\\ \frac{\delta[(1+e^{-\boldsymbol{W_n^Tx_i}})^{-1}]}{\delta \textbf{W}}=-u^{-2}(-1)(e^{-\boldsymbol{W_n^Tx_i}})\boldsymbol{x}_i\\\\ \frac{\delta[(1+e^{-\boldsymbol{W_n^Tx_i}})^{-1}]}{\delta \textbf{W}}=\frac{e^{-\boldsymbol{W_n^Tx_i}}}{(1+e^{-\boldsymbol{W^Tx_i}})^2}\boldsymbol{x}_i$

Plugging back our $\frac{\delta[(1+e^{-\boldsymbol{W_n^Tx_i}})^{-1}]}{\delta \textbf{W}}$ result, we get.

$\boldsymbol{W}_{n+1}=\boldsymbol{W}_n-((\frac{\delta[\sum_{i=1}^{m}\boldsymbol{x_i}(1+e^{-\boldsymbol{W_n^Tx_i}})^{-1}]}{\delta \textbf{W}}))^{-1} f'(x_n)\\\\ \boldsymbol{W}_{n+1}=\boldsymbol{W}_n-(\sum_{i=1}^{m}[\boldsymbol{x_i}\frac{e^{-\boldsymbol{W_n^Tx_i}}}{(1+e^{-\boldsymbol{W^Tx_i}})^2}\boldsymbol{x}_i])^{-1} f'(x_n)\\\\ \boldsymbol{W}_{n+1}=\boldsymbol{W}_n-(\sum_{i=1}^{m}[\boldsymbol{x_i}\frac{e^{-\boldsymbol{W_n^Tx_i}}}{(1+e^{-\boldsymbol{W^Tx_i}})^2}\boldsymbol{x}_i^T])^{-1} f'(x_n)$

We need to take transpose of $\boldsymbol{x}_i$ in the last line of equation above since it’s square of vector. And we take the later of $\boldsymbol{x}_i$ to be transposed because we know that this term should produce $n\times n$ matrix.  Just like what we did in the gradient descent, we can write the matrix form of it and substitute $f'(x_n)=\textbf{X}^T(\boldsymbol{f(W^Tx)}-\textbf{Y})$, we get.

$\boldsymbol{W}_{n+1}=\boldsymbol{W}_n-(\sum_{i=1}^{m}[\boldsymbol{x_i}\frac{e^{-\boldsymbol{W_n^Tx_i}}}{(1+e^{-\boldsymbol{W^Tx_i}})^2}\boldsymbol{x}_i^T])^{-1} f'(x_n)\\\\ \boxed{\boldsymbol{W}_{n+1}=\boldsymbol{W}_n-(\boldsymbol{X}^T\boldsymbol{R}\boldsymbol{X})^{-1}\boldsymbol{X}^T(\boldsymbol{f(W_n^Tx)}-\textbf{Y})}$

where $\boldsymbol{R}$ is diagonal matrix $m$x$m$ of $\frac{e^{-\boldsymbol{W_n^Tx_i}}}{(1+e^{-\boldsymbol{W^Tx_i}})^2}$. We prefer use $\frac{1}{(1+e^{-\boldsymbol{W^Tx_i}})}(1-\frac{1}{(1+e^{-\boldsymbol{W^Tx_i}})})$ because it is away more numerically stable when implementing into code in our computer. Then, we can just use this formula to train our logistic regression iteratively. Note that since this method needs inversing of our Hessian matrix, sometimes we find that our Hessian matrix is singular/not invertible.