# Bayesian Linear / Polynomial Regression #Part2: Deriving Predictive Distribution

We already derive the posterior update formula $P(W|D)$ for Bayesian regression here, telling us that it is distribution of our parameter regression $\textbf{W}$ given data set $D$. We are not directly interested in the value of $W$, but, we are interested in the value of $Y$ itself given new value of new $x$. This is exactly same with regression problem, given new value $x$, we want to predict output value of $Y$, which is in continuous value mode. And we already did linear regression problem using LSE (Least Square Error) here. During this post, we will do regression from Bayesian point of view. Using Bayesian in regression, we will have additional benefit. We will see later in the end of this post.

From #Part1 here, we already get $P(W|D)$. To do regression in Bayesian point of view, we have to derive predictive distribution, so that we will have probability of $Y$, $P(Y|\theta)$. We can achieve that by doing marginalization. Here we go.

$P(Y|\theta)=\int P(Y|W)P(W|\theta)dW$

where $P(Y|W)$ is likelihood and $P(W|\theta)$ is posterior we derive here Continue reading “Bayesian Linear / Polynomial Regression #Part2: Deriving Predictive Distribution”

# Deriving Polynomial Regression with Regularization to Avoid Overfitting

After we discuss about polynomial regression here using LSE (Least Square Error), we know that higher order of polynomial model has more capability to fit more complex data points, but more prone to be overfitting.  Picture below illustrates that red line (using high order) exactly fit those blue dot points, but will give big error, such as in axis 0.9. That is what we called overfitting (away too fit data training). In this case, green line is better, that has more general model to represent those data points.

We can avoid overfitting by using so-called regularization. How does it work? Usually, a function is prone to be overfitting when its coefficients (weighting values) has big value and not well distributed. Thus, we will force our training process to make those coefficients small by adding a term in our cost function. This process also makes those coefficients more well distributed. Here is our new cost function. Continue reading “Deriving Polynomial Regression with Regularization to Avoid Overfitting”