Bayesian Linear / Polynomial Regression #Part2: Deriving Predictive Distribution

We already derive the posterior update formula P(W|D) for Bayesian regression here, telling us that it is distribution of our parameter regression \textbf{W} given data set D. We are not directly interested in the value of W, but, we are interested in the value of Y itself given new value of new x. This is exactly same with regression problem, given new value x, we want to predict output value of Y, which is in continuous value mode. And we already did linear regression problem using LSE (Least Square Error) here. During this post, we will do regression from Bayesian point of view. Using Bayesian in regression, we will have additional benefit. We will see later in the end of this post.

From #Part1 here, we already get P(W|D). To do regression in Bayesian point of view, we have to derive predictive distribution, so that we will have probability of Y, P(Y|\theta). We can achieve that by doing marginalization. Here we go.

P(Y|\theta)=\int P(Y|W)P(W|\theta)dW

where P(Y|W) is likelihood and P(W|\theta) is posterior we derive here Continue reading “Bayesian Linear / Polynomial Regression #Part2: Deriving Predictive Distribution”

Bayesian Linear / Polynomial Regression #Part1: Prove LSE vs Bayesian Regression and Derive Posterior Update

In the beginning of our article series, we already talk about how to derive polynomial regression using LSE (Linear Square Estimation) here. During this post, we will try to discuss linear regression from Bayesian point of view. Note that linear and polynomial regression here are similar in derivation, the difference is only in design matrix. You may check again our couple previous articles here and here.

I let you know in the beginning that the final result of deriving regression using LSE is equal to the result of deriving linear regression using MLE (Maximal Likelihood Estimation) in Bayesian method. Furthermore, the result of deriving regression using LSE with regularization is equal to the result of deriving using MAP (Maximum A Posteriori) in Bayesian method. During this post, we will try to prove it. And we will proceed to derive the posterior update formula for online learning using Conjugate prior.

 

(1) Regression using LSE = MLE Bayesian?

See picture below.

Continue reading “Bayesian Linear / Polynomial Regression #Part1: Prove LSE vs Bayesian Regression and Derive Posterior Update”

Using Gaussian Distribution for Online Learning/Sequential Learning in Bayesian Inference

We already discuss how online learning works here using Conjugate distributions with Binomial distribution as the likelihood and Beta distribution as conjugate prior distribution. During this post, we will try to use Gaussian distribution for online learning in Bayesian inference. Conjugate prior of Gaussian distribution is Gaussian itself. That’s why we call Gaussian distribution self-conjugate. Let’s try to derive it.

Given trial result D=\begin{Bmatrix} x_1,x_2,x_3,... ,x_n\end{Bmatrix}, from Bayesian formula we get:

P(\theta_{new}|D)=\frac{P(D|\theta_{old})P(\theta_{old})}{P(D)}

We will try to derive the posterior P(\theta_{new}|D), given likelihood P(D|\theta_{old}) and prior distribution P(\theta_{old}). Parameter \theta of Gaussian in this case are \mu and \sigma^2. In this post, we will demonstrate how to calculate posterior P(\theta_{new}|D) under the assumption that  \sigma_{new}^2, \mu_{old}, \sigma_{old}^2 are know. Thus, we will only learn parameter \mu. We will ignore marginal probability P(D) first, since it is only constant value for normalization. Proceeding our formula above, we can do as follows. Continue reading “Using Gaussian Distribution for Online Learning/Sequential Learning in Bayesian Inference”

Understanding Online/Sequential Learning in Bayesian Inference

After we understand the concept of Bernoulli, Binomial and Beta distribution we discuss here, we are ready to understand online learning used in Bayesian inference now. In Bayesian theorem we discuss here, we have equation below.

P(B|A)=\frac{P(A|B)P(B)}{P(A)}

And for multi classes c_1, \,c_2, ... , \,c_n with multi attributes \theta_1, \,\theta_2,... ,\,\theta_n, we can write as follow.

P(c_i|\theta_1,\,\theta_2,\,... ,\,\theta_n)=\frac{P(\theta_1,\,\theta_2,\,... ,\,\theta_n|c_i)P(c_i)}{P(\theta_1,\,\theta_2,\,... ,\,\theta_n)}

Using rule of sum with m is the number of classes, we can change the denominator becomes:

P(c_i|\theta_1,\,\theta_2,\,... ,\,\theta_n)=\frac{P(\theta_1,\,\theta_2,\,... ,\,\theta_n|c_i)P(c_i)}{\sum_{i=1}^{m}P(\theta_1,\,\theta_2,\,... ,\,\theta_n)P(c_i)dc_i},  for discrete system

P(c_i|\theta_1,\,\theta_2,\,... ,\,\theta_n)=\frac{P(\theta_1,\,\theta_2,\,... ,\,\theta_n|c_i)P(c_i)}{\int_{}^{}P(\theta_1,\,\theta_2,\,... ,\,\theta_n)P(c_i)dc_i},  for continuous system

Here, we can say P(c_i|\theta_1,\,\theta_2,\,... ,\,\theta_n) is posterior probability, P(\theta_1,\,\theta_2,\,... ,\,\theta_n|c_i) is likelihood,  P(c_i) is posterior probability, and \sum_{i=1}^{m}P(\theta_1,\,\theta_2,\,... ,\,\theta_n)P(c_i)dc_i is evidence or marginal probability.

In online learning, we will update our prior probability when we do some new trials. For example, in the first stage, we do some tossing coin, and we model the prior probability with P_1(X). And at this point, we will use prior probability P_1(X) to estimate our posterior probability. Let it (posterior probability) be P_2(X). In the next stage trials, we will use P_2(X) as our prior probability to estimate our next posterior probability. And we will continue to do this when we do some trials again. That’s why we call this online learning. Some references also call it sequential learning. Continue reading “Understanding Online/Sequential Learning in Bayesian Inference”