Bayesian Linear / Polynomial Regression #Part1: Prove LSE vs Bayesian Regression and Derive Posterior Update

In the beginning of our article series, we already talk about how to derive polynomial regression using LSE (Linear Square Estimation) here. During this post, we will try to discuss linear regression from Bayesian point of view. Note that linear and polynomial regression here are similar in derivation, the difference is only in design matrix. You may check again our couple previous articles here and here.

I let you know in the beginning that the final result of deriving regression using LSE is equal to the result of deriving linear regression using MLE (Maximal Likelihood Estimation) in Bayesian method. Furthermore, the result of deriving regression using LSE with regularization is equal to the result of deriving using MAP (Maximum A Posteriori) in Bayesian method. During this post, we will try to prove it. And we will proceed to derive the posterior update formula for online learning using Conjugate prior.


(1) Regression using LSE = MLE Bayesian?

See picture below.

Given data (green dots), we model Y_i in pink line using Gaussian distribution, and our regression prediction is the mean of it (\textbf{W}^T\boldsymbol{x_i}), depicted using red line. To do regression toward given data, we can make formula below.


In this case, we can think that \varepsilon is noise/bias error of our given data, so that we should consider it in our model. Thus, we will form a distribution (using Gaussian) for Y vertically like shown with the pink line in the picture above. It will have higher value when closer to the red line or given data, and smaller value when further to the read line or given data. And our goal is to maximize the probability of Y meaning that we encourage the prediction to fit the given data.

We will model distribution of Y_i with Gaussian distribution \mathcal{N}(\boldsymbol{W}^T\boldsymbol{x_i}, \sigma^2) \Leftrightarrow \mathcal{N}(\boldsymbol{W}^T\boldsymbol{x_i}, a^{-1}) (note that precision a=\frac{1}{variance\, \sigma^2}). We can think that the distribution mean is \boldsymbol{W}^T\boldsymbol{x_i}, which is located in the point whose Gaussian probability is maximal (remember: Gaussian distribution has max value in the mean axis). Thus, by maximizing the probability, we encourage our prediction to fit the given data. Note that the mean \boldsymbol{W}^T\boldsymbol{x_i} is dependent variable of \boldsymbol x_i, meaning that it will has its own value for different input x_i. Later, when we talk about “predictive distribution”, we will see that \boldsymbol{W}^T\boldsymbol{x_i} actually is our output prediction.  As for variance \sigma^2, we can think it is the variance of our bias in our data. To sum up, probability Y_i \rightarrow \mathcal{N}(\boldsymbol{W}^T\boldsymbol{x_i}, \sigma^2) in univariate Gaussian distribution that has different distribution in each certain point of axis x, where \boldsymbol{W}^T\boldsymbol{x_i} is the prediction of Y_i, and is a function of x.

Given m-pair data training D =\begin{Bmatrix} (x_1,y_1),(x_2,y_2),(x_3,y_3),...,(x_m,y_m)\end{Bmatrix}, we will do MLE to maximize our likelihood Y_i \rightarrow \mathcal{N}(\boldsymbol{W}^T\boldsymbol{x_i}, \sigma^2). Our likelihood will be:

likelihood \,P(D_i|W)=\prod_{i=1}^{m}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2\sigma^2}(Y_i-W^Tx_i)^2}\\\\  likelihood \,P(D_i|W)=(\frac{1}{\sqrt{2\pi}\sigma})^m \,e^{-\frac{1}{2\sigma^2}\sum_{i=1}^{m}(Y_i-W^Tx_i)^2}

As usual, we will bring to log form.

ln\,P(D_i|W)=ln((\frac{1}{\sqrt{2\pi}\sigma})^m \,e^{-\frac{1}{2\sigma^2}\sum_{i=1}^{m}(Y_i-W^Tx_i)^2})\\\\  ln\,P(D_i|W)=m\,ln(\frac{1}{\sqrt{2\pi}\sigma})+-\frac{1}{2\sigma^2}\sum_{i=1}^{m}(Y_i-W^Tx_i)^2ln(e)\\\\  ln\,P(D_i|W)=m\,ln(\frac{1}{\sqrt{2\pi}\sigma})+-\frac{1}{2\sigma^2}\sum_{i=1}^{m}(Y_i-W^Tx_i)^2

To maximize the likelihood, we will take first differential w.r.t W and make it equals to zero.

\frac{d[ln\, P(D_i|W)]}{dW}=0+\frac{d[-\frac{1}{2\sigma^2}\sum_{i=1}^{m}(Y_i-W^Tx_i)^2]}{dW}=0\\\\  0=\frac{d[-\frac{1}{2\sigma^2}\sum_{i=1}^{m}(Y_i-W^Tx_i)^2]}{dW}

Do you feel similar the last equation form? Yes, it is similar with what we do in linear regression using LSE here. In the linear regression with LSE, we maximize our cost function J(a)=\frac{1}{2m}\sum_{i=1}^{m}(h(x_i)-Y_i)^2=\frac{1}{2m}\sum_{i=1}^{m}(a^Tx_i-Y_i)^2. When maximizing the function, we will make the  first differential equals to zero. Thus, the constant will disappear in this case, whereas our W corresponds to a. Finally, we can prove these two equations to be maximized are same.

\sum_{i=1}^{m}(Y_i-W^Tx_i)^2 = \sum_{i=1}^{m}(a^Tx_i-Y_i)^2

Up to this point, we successfully prove that linear regression using LSE equals to MLE in Bayesian method.


(2) Regression using LSE with Regularization= MAP Bayesian?

Let’s try to do MAP Bayesian maximizing the posterior probability in linear regression. Given m-pair data training D =\begin{Bmatrix} (x_1,y_1),(x_2,y_2),(x_3,y_3),...,(x_m,y_m)\end{Bmatrix}, we will try to estimate our regression parameter W using Bayesian formula.

Posterior = \frac{Likelihood\times Prior}{Marginal} \rightarrow P(W|D)=\frac{P(D|W)P(W)}{P(D)}\\\\  Likelihood \rightarrow P(D_i|W) \rightarrow \mathcal{N}(\boldsymbol{W}^T\boldsymbol{x_i}, a^{-1})\\\\  Prior \rightarrow \mathcal{N}(\boldsymbol{0},\boldsymbol{b^{-1}})

For likelihood, again, it is univariate Gaussian just like when we did in MLE above.  As for prior, it is multivariate Gaussian, and we will use isotropic form (covariance matrix = scalar x identity matrix). We use precision scalar a and precision matrix \boldsymbol{b} (precision is equal to \frac{1}{variance}) instead of variance or covariance matrix to make the derivation easier. Our prior in this case is our prior knowledge of how \boldsymbol{W} is distributed. Ok, let’s proceed to the derivation. We will ignore marginal distribution since it’s only constant normalization.

P(W|D)=\frac{P(D|W)P(W)}{P(D)}\\\\  P(W|D) \propto (\frac{1}{\sqrt{2\pi}a^{-1}})^m e^{-\frac{a}{2}\sum_{i=1}^{m}(W^Tx_i-Y_i)^2}\frac{1}{(\sqrt{2\pi})^m|b^{-1}|^{\frac{1}{2}}}e^{(W_i-0)^T(-\frac{\boldsymbol{b}}{2})(W_i-0)}

Again, we will bring to log form.

ln [P(W|D)] \propto ln[(\frac{1}{\sqrt{2\pi}a^{-1}})^m e^{-\frac{a}{2}\sum_{i=1}^{m}(W^Tx_i-Y_i)^2}\frac{1}{(\sqrt{2\pi})^m|b^{-1}|^{\frac{1}{2}}}e^{(W_i-0)^T(-\frac{\boldsymbol{b}}{2})(W_i-0)}]\\\\  ln [P(W|D)] \propto m\,ln[(\frac{1}{\sqrt{2\pi}a^{-1}})+{-\frac{a}{2}\sum_{i=1}^{m}(W^Tx_i-Y_i)^2}\,ln( e)\\+ln[\frac{1}{(\sqrt{2\pi})^m|b^{-1}|^{\frac{1}{2}}}]+{W_i^T(-\frac{\boldsymbol{b}}{2})W_i}\,ln(e)\\\\  ln [P(W|D)] \propto m\,ln[(\frac{1}{\sqrt{2\pi}a^{-1}})+{-\frac{a}{2}\sum_{i=1}^{m}(W^Tx_i-Y_i)^2}\\+ln[\frac{1}{(\sqrt{2\pi})^m|b^{-1}|^{\frac{1}{2}}}]+{W_i^T(-\frac{\boldsymbol{b}}{2})W_i}

To get the regression parameter W that maximizes posterior probability, we will take first differential w.r.t W and make it equals to zero. Here we go.

\frac{d(ln [P(W|D)])}{dW} \propto 0+\frac{d[{-\frac{a}{2}\sum_{i=1}^{m}(W^Tx_i-Y_i)^2}]}{dW}+0+\frac{d[{{W_i^T(-\frac{\boldsymbol{b}}{2})W_i}}]}{dW}=0\\\\  0 = \frac{d[{-\frac{a}{2}\sum_{i=1}^{m}(W^Tx_i-Y_i)^2}]}{dW}+\frac{d[{{W_i^T(-\frac{\boldsymbol{b}}{2})W_i}}]}{dW}

The last line of equation above is similar with our cost function in regression using LSE with regularization here, which is J(\mathbf{a})=\frac{1}{2m}[\sum_{i=1}^{m}(h(x_i)-y_i)^2)+\lambda\sum_{j=1}^{n}a_j^2] . Ok, let’s break down for more clear comparison.

{-\frac{a}{2}\sum_{i=1}^{m}(W^Tx_i-Y_i)^2} + {{W_i^T(-\frac{\boldsymbol{b}}{2})W_i}} = \frac{1}{2m}[\sum_{i=1}^{m}(h(x_i)-y_i)^2)+\lambda\sum_{j=1}^{n}a_j^2]

Divide both side by -\frac{a}{2}, we get:

{-\frac{a}{2}\sum_{i=1}^{m}(W^Tx_i-Y_i)^2} + {{W_i^T(-\frac{\boldsymbol{b}}{2})W_i}} =?\, \frac{1}{2m}[\sum_{i=1}^{m}(h(x_i)-y_i)^2)+\lambda\sum_{j=1}^{n}a_j^2]\\\\  {\sum_{i=1}^{m}(W^Tx_i-Y_i)^2} + {{W_i^T(\frac{\boldsymbol{-b}/2}{-a/2})W_i}} =?\, \frac{1/2m}{-a/2}[\sum_{i=1}^{m}(h(x_i)-y_i)^2)+\lambda\sum_{j=1}^{n}a_j^2]\\\\  {\sum_{i=1}^{m}(W^Tx_i-Y_i)^2} +{W_i^T(\frac{\boldsymbol{b}}{a})W_i} =?\, \frac{1}{-am}[\sum_{i=1}^{m}(h(x_i)-y_i)^2)+\lambda\sum_{j=1}^{n}a_j^2]

Since when maximizing we take first differential and make it equals to zero, thus, the constant value \frac{1}{-am} will disappear. And finally, we have:

{\sum_{i=1}^{m}(W^Tx_i-Y_i)^2} +{W_i^T(\frac{\boldsymbol{b}}{a})W_i} = \sum_{i=1}^{m}(h(x_i)-y_i)^2)+\lambda\sum_{j=1}^{n}a_j^2

where h(x_i)=a^Tx, and W in this case corresponds to a, which is the regression parameter.  W_i^TW_i is just the matrix form of  \sum_{i=1}^{n}W_i^2, and \frac{\textbf{b}}{a} corresponds to \lambda, which our regularization weighting value. Thus, we prove that regression using LSE with regularization is equal to MAP in Bayesian regression.


(3) Deriving posterior probability for Bayesian regression

After we can prove two things above, let’s proceed to derive the posterior P(W|D) for online learning. The last result, we have:

P(W|D) \propto e^{-\frac{a}{2}\sum_{i=1}^{m}(W^Tx_i-Y_i)^2 + {W_i^T(-\frac{\boldsymbol{b}}{2})W_i}}

For one pair input data in every iteration in online learning, we will have single likelihood probability. Or in another word, the summation in the likelihood above will disappear. In this case, our posterior becomes as follows.

P(W|D) \propto e^{-\frac{a}{2}(W^Tx_i-Y_i)^2 + {W_i^T(-\frac{\boldsymbol{b}}{2})W_i}}

To get update parameters mean and variance, we can do “completing the square” technique like what we already did in deriving Gaussian online learning here, since product of two Gaussian will produce another new Gaussian. Let our new Gaussian \mathcal{N}(\mu_{new}, \Lambda{new}^{-1}).

P(W|D)_{new} \propto e^{{(W_i-\mu_{new})^T(-\frac{\boldsymbol{\Lambda_{new}}}{2})(W_i-\mu_{new})}}

Then, let’s do “completing the square” for the power term of both equations above by collecting those terms regarding coefficient of W. Here we go.

-\frac{a}{2}(W^Tx-Y)^2 + {W_i^T(-\frac{\boldsymbol{b}}{2})W_i} = {(W-\mu_{new})^T(-\frac{\boldsymbol{\Lambda_{new}}}{2})(W-\mu_{new})}\\\\  a(W^Tx-Y)^2 + {W^T(\boldsymbol{b})W} = {(W-\mu_{new})^T(\boldsymbol{\Lambda_{new}})(W-\mu_{new})}\\\\  a(x^TWW^Tx-x^TWY-Y^TW^Tx+Y^TY)+W^T\textbf{b}W=(W^T\Lambda_{new}-\mu_{new}^T\Lambda_{new})(W-\mu_{new})\\\\  aW^Txx^TW-a2W^TxY+aY^TY+W^T\textbf{b}W=W^T\Lambda_{new} W-W^T\Lambda_{new} \mu - \mu_{new}^T\Lambda_{new} W + \mu^T \Lambda_{new} \mu_{new}\\\\  aW^Txx^TW+W^T\textbf{b}W-a2W^TxY+aY^TY=W^T\Lambda_{new} W-2W^T\Lambda_{new} \mu_{new} + \mu_{new}^T \Lambda_{new} \mu_{new}\\\\  W^T(axx^T+\textbf{b})W-a2W^TxY+aY^TY=W^T\Lambda_{new} W-2W^T\Lambda_{new} \mu_{new} + \mu_{new}^T \Lambda_{new} \mu_{new}

For equation above, in W^TW coefficient, we can find our new precision matrix \Lambda_{new}.

\boxed{ \Lambda_{new} = axx^T+\textbf{b}}

From our coefficient W, we can derive \mu_{new}.

a2xY=2\Lambda_{new} \mu_{new}\\\\  axY=\Lambda_{new}\mu_{new}\\\\  \mu_{new}=\Lambda_{new}^{-1}axY\\\\  \boxed{\mu_{new}=a\Lambda_{new}^{-1}xY}

For the clarity, let’s put the matrix form each of them. Just for example, when the polynomial order = 3. Thus:

a = scalar,\, which\,is\,our\,noise\,variance\\\\  x=design\, matrix=\begin{bmatrix}x_i^0\\x_i^1\\x_i^2\end{bmatrix}=\begin{bmatrix}1\\x_i\\x_i^2\end{bmatrix}\\\\  \textbf{b}=prior\,precision\,matrix=b\textbf{I}=\begin{bmatrix}b&0&0\\0&b&0\\0&0&b\end{bmatrix}\\\\  Y=scalar\,data\,Y_i\\\\  \Lambda_{new}=matrix\,3\times 3=\begin{bmatrix}\checkmark &\checkmark &\checkmark\\\checkmark &\checkmark &\checkmark\\\checkmark &\checkmark &\checkmark\end{bmatrix}\\\\  \mu_{new}=matrix\,3\times 1=\begin{bmatrix}\mu_1\\\mu_2\\\mu_3\end{bmatrix}=\begin{bmatrix}E[W_1]\\E[W_2]\\E[W_3]\end{bmatrix}

Form above is for one pair data (x_i,Y_i). For m-pairs data, we can use design matrix \textbf{x} =\begin{bmatrix}1&x_1&x_1^2\\1&x_2&x_2^2\\...&...&...\\1&x_m&x_m^2\\\end{bmatrix} with \textbf{Y}=\begin{bmatrix}Y_1\\Y_2\\...\\Y_m\end{bmatrix}. Thus, the updating parameters become:

\boxed{ \Lambda_{new} = a\boldsymbol{x^Tx}+\textbf{b}}


Congratulation! Up to this point, we already get the parameters update of new mean and new precision matrix. By doing all derivations in this post, we also get the intuition how Bayesian method works, which is an important and potential framework in machine learning. FYI, some scientists are proud and believe that Bayesian framework will become powerful framework in machine learning that nowadays is dominated by deep learning. Maybe couple years later Bayesian will beat and replace deep learning? Or complement it? Let’s see 🙂


(4) Frequentist vs Bayesian

In probability, generally, there are two types of reasoning approaches : frequentist and Bayesian. Frequentist has reasoning based on likelihood by doing MLE (Maximum Likelihood Estimation), whereas Bayesian has reasoning based on posterior by doing MAP (Maximum A Posteriori). Re-write again Bayesian formula, we can clearly see the difference mathematically.

Posterior = \frac{Likelihood \times Prior}{Marginal} \rightarrow P(class|params)=\frac{P(params|class)P(class)}{P(params)}

Frequentist only maximizes likelihood P(params|class) to make prediction, whereas Bayesian maximizes posterior P(class|params) which uses prior knowledge P(class). A simple example, if we have tossing coin : sequence 1 = \begin{Bmatrix}{H,T,H,T,T}\end{Bmatrix}, sequence 2 = \begin{Bmatrix}{H,T,H,H}\end{Bmatrix}, frequentist will argue that probability of T (tails) in sequence 2 =\frac{1}{4}, which only consider in that sequence, without considering prior (previous trials). Whereas, Bayesian will argue that the probability of T in sequence 2 =\frac{4}{9}, which also consider prior knowledge (previous trials).

In regression, frequentist will suffer overfitting. Because like we already prove above, frequentist which only use likelihood is equal to LSE regression without regularization. Whereas, Bayesian is less prone to be overvitting because it use prior knowledge, or in LSE regression point of view it uses regularization to avoid overfitting.

3 thoughts on “Bayesian Linear / Polynomial Regression #Part1: Prove LSE vs Bayesian Regression and Derive Posterior Update

  1. Pingback: Tina Foote
  2. Thank you very much for the post. I have a question regarding the distribution of Y. How can we prove that Y is normally distributed? Because Y is not directly dependent on “e” and “x” is random and “Y” is directly dependent on “x” which might not be normally distributed.

    Could you please explain?

    Sincere regards,
    Ganesh S K

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s