Bayesian Linear / Polynomial Regression #Part2: Deriving Predictive Distribution

We already derive the posterior update formula P(W|D) for Bayesian regression here, telling us that it is distribution of our parameter regression \textbf{W} given data set D. We are not directly interested in the value of W, but, we are interested in the value of Y itself given new value of new x. This is exactly same with regression problem, given new value x, we want to predict output value of Y, which is in continuous value mode. And we already did linear regression problem using LSE (Least Square Error) here. During this post, we will do regression from Bayesian point of view. Using Bayesian in regression, we will have additional benefit. We will see later in the end of this post.

From #Part1 here, we already get P(W|D). To do regression in Bayesian point of view, we have to derive predictive distribution, so that we will have probability of Y, P(Y|\theta). We can achieve that by doing marginalization. Here we go.

P(Y|\theta)=\int P(Y|W)P(W|\theta)dW

where P(Y|W) is likelihood and P(W|\theta) is posterior we derive here Continue reading “Bayesian Linear / Polynomial Regression #Part2: Deriving Predictive Distribution”

Understanding Multivariate Gaussian, Gaussian Properties and Gaussian Mixture Model

(1) Multivariate Gaussian

We already discuss Gaussian distribution function with one variable (univariate) here. In this post, we will discuss about Gaussian distribution function with multi variables (multivariate), which is the general form of Gaussian distribution. For k-dimentional vector \textbf{x}, multivariate Gaussian distribution is defined as follows.

\boxed{ \mathcal{N}(x|\boldsymbol{\mu,\Sigma})=\frac{1}{\sqrt{(2\pi)^k}|\boldsymbol{ \Sigma}|^{\frac{1}{2}}}e^{-\frac{1}{2}(\boldsymbol{x-\mu})^T\boldsymbol{\Sigma^{-1}}(\boldsymbol{x-\mu})}}

\boldsymbol{ \Sigma}=\begin{bmatrix}  &\sigma_{_{X_1}}&COV(X_1,X_2)&COV(X_1,X_3)&...&COV(X_1,X_k)\\\\  &COV(X_2,X_1)&\sigma_{_{X_2}}&COV(X_2,X_3)&...&COV(X_2,X_k)\\\\  &COV(X_3,X_1)&COV(X_3,X_2)&\sigma_{_{X_3}}&...&COV(X_3,X_k)\\\\  &\vdots& & &\ddots&\vdots\\\\  &COV(X_k,X_1)&COV(X_k,X_2)&COV(X_k,X_3)&...&\sigma_{_{X_k}}  \end{bmatrix},\, \boldmath{x=\begin{bmatrix}x_1\\\\x_2\\\\x_3\\\\\vdots\\\\x_k\end{bmatrix}},\, \boldmath{\mu=\begin{bmatrix}\mu_1\\\\\mu_2\\\\\mu_3\\\\\vdots\\\\\mu_k\end{bmatrix}}

where \boldsymbol{\mu} is a k-dimensional mean vector, \boldsymbol{\Sigma} is k \times k covariance matrix, and |\boldsymbol{\Sigma}| is the determinant of matrix \boldsymbol{\Sigma}. Continue reading “Understanding Multivariate Gaussian, Gaussian Properties and Gaussian Mixture Model”

Using Gaussian Distribution for Online Learning/Sequential Learning in Bayesian Inference

We already discuss how online learning works here using Conjugate distributions with Binomial distribution as the likelihood and Beta distribution as conjugate prior distribution. During this post, we will try to use Gaussian distribution for online learning in Bayesian inference. Conjugate prior of Gaussian distribution is Gaussian itself. That’s why we call Gaussian distribution self-conjugate. Let’s try to derive it.

Given trial result D=\begin{Bmatrix} x_1,x_2,x_3,... ,x_n\end{Bmatrix}, from Bayesian formula we get:

P(\theta_{new}|D)=\frac{P(D|\theta_{old})P(\theta_{old})}{P(D)}

We will try to derive the posterior P(\theta_{new}|D), given likelihood P(D|\theta_{old}) and prior distribution P(\theta_{old}). Parameter \theta of Gaussian in this case are \mu and \sigma^2. In this post, we will demonstrate how to calculate posterior P(\theta_{new}|D) under the assumption that  \sigma_{new}^2, \mu_{old}, \sigma_{old}^2 are know. Thus, we will only learn parameter \mu. We will ignore marginal probability P(D) first, since it is only constant value for normalization. Proceeding our formula above, we can do as follows. Continue reading “Using Gaussian Distribution for Online Learning/Sequential Learning in Bayesian Inference”

Deriving Gaussian Distribution

Gaussian is very important distribution. During this post, we will discuss the detail of Gaussian distribution by deriving it, calculate the integral value and do MLE (Maximum Likelihood Estimation). To derive Gaussian distribution, it is more difficult if we do it in cartesian coordinate. Thus, we will use polar coordinate. Before we derive the Gaussian using polar coordinate, let’s talk about how to change the coordinate system from cartesian to polar coordinate system first.

(1) Changing coordinate system from cartesian to polar coordinate

Changing coordinate system from catersian to polar coordinate is useful, such as when we calculate integral of certain function, in certain case, we prefer to use polar coordinate system because it will be away easier to calculate. To do that, we can use Jacobian matrix. Jacobian matrix actually defines partial derivative of a vector with respect to another vector. In our case changing cartesian coordinate to polar coordinate, the Jacobian matrix of (x,y) in cartesian coordinate with respect to (r,\theta) in polar coordinate is:

J = \begin{bmatrix} \frac{\delta x}{\delta r} \,\frac{\delta x}{\delta \theta} \\\frac{\delta y}{\delta r} \,\frac{\delta y}{\delta \theta}\end{bmatrix} Continue reading “Deriving Gaussian Distribution”