Bayesian Linear / Polynomial Regression #Part2: Deriving Predictive Distribution

We already derive the posterior update formula P(W|D) for Bayesian regression here, telling us that it is distribution of our parameter regression \textbf{W} given data set D. We are not directly interested in the value of W, but, we are interested in the value of Y itself given new value of new x. This is exactly same with regression problem, given new value x, we want to predict output value of Y, which is in continuous value mode. And we already did linear regression problem using LSE (Least Square Error) here. During this post, we will do regression from Bayesian point of view. Using Bayesian in regression, we will have additional benefit. We will see later in the end of this post.

From #Part1 here, we already get P(W|D). To do regression in Bayesian point of view, we have to derive predictive distribution, so that we will have probability of Y, P(Y|\theta). We can achieve that by doing marginalization. Here we go.

P(Y|\theta)=\int P(Y|W)P(W|\theta)dW

where P(Y|W) is likelihood and P(W|\theta) is posterior we derive here

Equation above is rule of sum (term used in Bishop text, or also called law of total probability) in Bayesian formula. Parameters \theta in our likelihood is a^{-1}, and for our posterior P(W|\theta) are \boldsymbol{\mu, \Lambda^{-1}}. Let’s write in our equation first.

P(Y|\theta)=\int P(Y|W)P(W|\theta)dW\\\\ P(Y|\theta) = \int P(Y|\textbf{W},a^{-1})P(\textbf{W}|\boldsymbol{\mu,\Lambda^{-1}})\\\\ P(Y|\theta) = \mathcal{N}(Y|\boldsymbol{W^Tx}, a^{-1})\,\mathcal{N}(\boldsymbol{W}|\boldsymbol{\mu, \Lambda^{-1}})

Form equation above, let’s complete our derivation.

P(Y|\theta) \propto \int e^{\frac{-a}{2}(Y-\boldsymbol{W^Tx})^2}\, e^{-\frac{1}{2}(\boldsymbol{W}-\boldsymbol{\mu})^T \boldsymbol{\Lambda}(\boldsymbol{W}-\boldsymbol{\mu})}dW\\\\ P(Y|\theta) \propto \int e^{-\frac{a}{2}(Y^2-2\boldsymbol{W^Tx}Y+(\boldsymbol{W^Tx})^2)+(-\frac{1}{2})(\boldsymbol{W^T\Lambda W}-2\boldsymbol{W^T\Lambda \mu}+\boldsymbol{\mu^T\Lambda \mu})}dW\\\\ P(Y|\theta) \propto \int e^{-\frac{1}{2}(aY^2-2\boldsymbol{W^Tx}Ya+a\boldsymbol{W^Txx^TW})+(\boldsymbol{W^T\Lambda W}-2\boldsymbol{W^T\Lambda \mu}+\boldsymbol{\mu^T\Lambda \mu})}dW\\\\ P(Y|\theta) \propto \int e^{-\frac{1}{2}(\boldsymbol{W^T}(a\boldsymbol{xx^T}+\boldsymbol{\Lambda})\boldsymbol{W}-2\boldsymbol{W^T}(\boldsymbol{x}Ya+\boldsymbol{\Lambda \mu})+aY^2+\boldsymbol{\mu^T\Lambda\mu})}dW

We will use similar technique we already used before in multiplying two Gaussians, which is “completing the square”. But, because our probability is P(Y|\theta), we need to collect the terms regarding Y coefficients, not W. To do this, we have to modify a little bit. Let c=a\boldsymbol{xx^T}+\boldsymbol{\Lambda}, and m=c^{-1}(\boldsymbol{x}Ya+\boldsymbol{\Lambda \mu}).

P(Y|\theta) \propto \int e^{-\frac{1}{2}(\boldsymbol{W^TcW}-2\boldsymbol{W^Tcm}+\boldsymbol{m^Tcm}-\boldsymbol{m^Tcm}+aY^2-\boldsymbol{\mu^{-1}\Lambda \mu})}dW\\\\ P(Y|\theta) \propto \int e^{-\frac{1}{2}(\boldsymbol{W}-\boldsymbol{m})^T \boldsymbol{c}(\boldsymbol{W}-\boldsymbol{m})} \, e^{\frac{1}{2}(\boldsymbol{m^Tcm}-aY^2-\boldsymbol{\mu^{-1}\Lambda \mu})}dW\\\\ P(Y|\theta) \propto \int e^{-\frac{1}{2}(\boldsymbol{W}-\boldsymbol{m})^T \boldsymbol{c}(\boldsymbol{W}-\boldsymbol{m})} dW\, \int e^{\frac{1}{2}(\boldsymbol{m^Tcm}-aY^2-\boldsymbol{\mu^{-1}\Lambda \mu})}dW

We know that \int constant \times e^{-\frac{1}{2}(\boldsymbol{W}-\boldsymbol{m})^T \boldsymbol{c}(\boldsymbol{W}-\boldsymbol{m})} dW = 1, since it’s a Gaussian probability distribution. Thus, our last formula becomes:

P(Y|\theta) \propto \int \frac{1}{constant}e^{\frac{1}{2}(\boldsymbol{m^Tcm}-aY^2-\boldsymbol{\mu^{-1}\Lambda \mu})}dW \\\\ P(Y|\theta) \propto \int \frac{1}{constant}e^{-\frac{1}{2}(aY^2-\boldsymbol{m^Tcm}+\boldsymbol{\mu^{-1}\Lambda \mu})}dW

We will remove \frac{1}{constant} since we don’t really care constant value in Gaussian probability. Parameters we care are only mean and variance, and once we get them, the Gaussian function is already normalized (integrates to 1). Putting \textbf{m} and \textbf{c} we defined before, we get:

P(Y|\theta) \propto \int e^{-\frac{1}{2}(aY^2-[\boldsymbol{c^{-1}}(\boldsymbol{x}Ya+\boldsymbol{\Lambda\mu})]^T\boldsymbol{cc^{-1}}(\boldsymbol{x}Ya+\boldsymbol{\Lambda\mu})+\boldsymbol{\mu^T\Lambda \mu})}\\\\ P(Y|\theta) \propto \int e^{-\frac{1}{2}(aY^2-(\boldsymbol{x}Ya+\boldsymbol{\Lambda\mu})^T(c^{-1})^T\boldsymbol{cc^{-1}}(\boldsymbol{x}Ya+\boldsymbol{\Lambda\mu})+\boldsymbol{\mu^T\Lambda \mu})}\\\\ P(Y|\theta) \propto \int e^{-\frac{1}{2}(aY^2-(a\boldsymbol{x}^TY+\boldsymbol{\mu^T\Lambda^T})\boldsymbol{c^{-1}}(\boldsymbol{x}Ya+\boldsymbol{\Lambda\mu})+\boldsymbol{\mu^T\Lambda \mu})}

The last line we get because \boldsymbol{c}^{-1} is symmetric so that (\boldsymbol{c}^{-1})^T=\boldsymbol{c}^{-1}. And \boldsymbol{c}^{-1} \boldsymbol{c}=1. Proceeding our derivation, we get:

P(Y|\theta) \propto \int e^{-\frac{1}{2}(aY^2-(a^2\boldsymbol{x^Tc^{-1}x}Y^2+a\boldsymbol{x^Tc^{-1}\Lambda \mu}Y+a\boldsymbol{\mu^T\Lambda^Tc^{-1}x}Y+\boldsymbol{\mu^T\Lambda^Tc^{-1}\Lambda\mu})+\boldsymbol{\mu^T\Lambda \mu})}\\\\ P(Y|\theta) \propto \int e^{-\frac{1}{2}(aY^2-(a^2\boldsymbol{x^Tc^{-1}x}Y^2+2(a\boldsymbol{x^Tc^{-1}\Lambda \mu})Y+\boldsymbol{\mu^T\Lambda^Tc^{-1}\Lambda\mu})+\boldsymbol{\mu^T\Lambda \mu})}\\\\ P(Y|\theta) \propto \int e^{-\frac{1}{2}(aY^2-a^2\boldsymbol{x^Tc^{-1}x}Y^2+2(a\boldsymbol{x^Tc^{-1}\Lambda \mu})Y+\boldsymbol{\mu^T\Lambda^Tc^{-1}\Lambda\mu}+\boldsymbol{\mu^T\Lambda \mu})}\\\\ P(Y|\theta) \propto \int e^{-\frac{1}{2}((a-a^2\boldsymbol{x^Tc^{-1}x})Y^2+2(a\boldsymbol{x^Tc^{-1}\Lambda \mu})Y+\boldsymbol{\mu^T\Lambda^Tc^{-1}\Lambda\mu}+\boldsymbol{\mu^T\Lambda \mu})}

In last formula above we already successfully gather term coefficients regarding Y. Let’s do “completing the square” now. Our new probability P(Y|\theta) \rightarrow \mathcal{N}(\boldsymbol{\mu_{new}, \Lambda_{new}^{-1}}).

P(Y|\theta) \propto e^{-\frac{1}{2}(\boldsymbol{Y}-\mu_{new})\Lambda_{new}(\boldsymbol{Y}-\mu_{new})} \\\\ P(Y|\theta) \propto e^{-\frac{1}{2}(\boldsymbol{Y^T\Lambda_{new}Y}-\boldsymbol{Y^T\Lambda_{new}\mu_{new}}-\boldsymbol{\mu_{new}^T\Lambda_{new}Y}+\boldsymbol{\mu_{new}^T\Lambda_{new}\mu_{new}})} \\\\ P(Y|\theta) \propto e^{-\frac{1}{2}(\boldsymbol{Y^T\Lambda_{new}Y}-2\boldsymbol{Y^T\Lambda_{new}\mu_{new}}+\boldsymbol{\mu_{new}^T\Lambda_{new}\mu_{new}})}

By comparing coefficient in Y^2, we can get our \boldsymbol{\Lambda_{new}}.

\Lambda_{new}=(a-a^2\boldsymbol{x^Tc^{-1}x})

And by comparing coefficient in Y, we can get our \boldsymbol{\mu_{new}}.

2\boldsymbol{\Lambda_{new}\mu_{new}}=2a\boldsymbol{x^Tc^{-1}\Lambda\mu}\\\\ \boldsymbol{\mu_{new}}=\boldsymbol{\Lambda_{new}}^{-1}a\boldsymbol{x^Tc^{-1}\Lambda\mu}

We still have to calculate \boldsymbol{c^{-1}}, where \boldsymbol{c}=a\boldsymbol{xx^T}+\boldsymbol{\Lambda}. By using Sherman-Morrison formula, \boldsymbol{c^{-1}}=\boldsymbol{\Lambda^{-1}}-\frac{\boldsymbol{\Lambda^{-1}}a\boldsymbol{xx^T}\boldsymbol{\Lambda^{-1}}}{1+a\boldsymbol{x^T\Lambda^{-1}x}}. Doing some algrebra operations, our {\Lambda_{new}}^{-1} and {\mu_{new}} become:

\boxed{\Lambda_{new}^{-1}=\sigma^2_{new}=\frac{1}{a}+\boldsymbol{x}^T\boldsymbol{\Lambda}\boldsymbol{x}},\,\, \boxed{\mu_{new}=\boldsymbol{\mu}^T\boldsymbol{x}}

Advertisements

3 thoughts on “Bayesian Linear / Polynomial Regression #Part2: Deriving Predictive Distribution

  1. Hi, just to tell you that at the end, the new Lambda is equal to 1/a + x^T Lambda^(-1) x (and the first x should not be bold in my opinion). Otherwise, great explanations !

    1. Also the x for the new mu should not be bold, then you obtain the scalar which correspond to the approximation of y.

    2. Hi,
      This post talks about predictive distribution which already estimates a scalar value of output prediction (\mu_{new}), and the upper & lower bound of predictive distribution represented by a scalar value of prediction variance (\Lambda_{new}^{-1}=\sigma^2_{new}). The learning parameters (\boldsymbol{\Lambda}\,and\,\boldsymbol{\mu}) are not scalar values, they are derived here: https://ardianumam.wordpress.com/2017/10/21/bayesian-linear-regression/ where the prior probability model uses zero mean isotropic Gaussian. So, the \boldsymbol{x} in \mu_{new} in this post is also bold, cz it is a desgin matrix, which is not a scalar value. For more detail, you can check Bishop textbook, page 153 and 156.

      I am sure about this since I already implement it in a code, and the output is correct (makes sense compared with corresponding result in Bishop textbook).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s