Understanding Multivariate Gaussian, Gaussian Properties and Gaussian Mixture Model

(1) Multivariate Gaussian

We already discuss Gaussian distribution function with one variable (univariate) here. In this post, we will discuss about Gaussian distribution function with multi variables (multivariate), which is the general form of Gaussian distribution. For k-dimentional vector \textbf{x}, multivariate Gaussian distribution is defined as follows.

\boxed{ \mathcal{N}(x|\boldsymbol{\mu,\Sigma})=\frac{1}{\sqrt{(2\pi)^k}|\boldsymbol{ \Sigma}|^{\frac{1}{2}}}e^{-\frac{1}{2}(\boldsymbol{x-\mu})^T\boldsymbol{\Sigma^{-1}}(\boldsymbol{x-\mu})}}

\boldsymbol{ \Sigma}=\begin{bmatrix}  &\sigma_{_{X_1}}&COV(X_1,X_2)&COV(X_1,X_3)&...&COV(X_1,X_k)\\\\  &COV(X_2,X_1)&\sigma_{_{X_2}}&COV(X_2,X_3)&...&COV(X_2,X_k)\\\\  &COV(X_3,X_1)&COV(X_3,X_2)&\sigma_{_{X_3}}&...&COV(X_3,X_k)\\\\  &\vdots& & &\ddots&\vdots\\\\  &COV(X_k,X_1)&COV(X_k,X_2)&COV(X_k,X_3)&...&\sigma_{_{X_k}}  \end{bmatrix},\, \boldmath{x=\begin{bmatrix}x_1\\\\x_2\\\\x_3\\\\\vdots\\\\x_k\end{bmatrix}},\, \boldmath{\mu=\begin{bmatrix}\mu_1\\\\\mu_2\\\\\mu_3\\\\\vdots\\\\\mu_k\end{bmatrix}}

where \boldsymbol{\mu} is a k-dimensional mean vector, \boldsymbol{\Sigma} is k \times k covariance matrix, and |\boldsymbol{\Sigma}| is the determinant of matrix \boldsymbol{\Sigma}.

(\boldsymbol{x-\mu})^T\boldsymbol{\Sigma^{-1}}(\boldsymbol{x-\mu})=\Delta is called Mahalanobis distance from \boldsymbol{\mu} to \boldsymbol{x}, and reduces to Euclidean distance when \boldsymbol{\Sigma} is an identity matrix.

 

(2) Gaussian properties

In this sub-topic, we will discuss about three useful properties of Gaussians’, namely affine property, property in marginal distribution and property in conditional distribution.

2.1 Affine property of Gaussian distribution

Gaussian have affine property that is useful in forming new Gaussian distribution whose data inputs are combination of some inputs with Gaussian distribution too, for example sum of two inputs of Gaussian distributed data. Affine property of Gaussian distribution is defined as follows.

\boxed {\boldmath{x} \sim \mathcal{N}(\boldmath{\mu},\boldmath{\Sigma})\Leftrightarrow (\boldmath{Ax}+\boldmath{b}) \sim \mathcal{N}(\boldmath{A\mu}+\boldmath{b},\boldmath{A\Sigma A}^T)}

Let’s try to demonstrate by summing two inputs of Gaussian distributed data. Given vector data \boldsymbol{x_1, x_2} whose distributions are Gaussian \mathcal{N}_{x_1}(\boldsymbol{\mu_{x_1}},\boldsymbol{\Sigma_{x_1}}) and \mathcal{N}{x_2}(\boldsymbol{\mu_{x_2}},\boldsymbol{\Sigma_{x_2}}). To derive new Gaussian distribution \mathcal{N}{y}(\boldsymbol{\mu_y},\boldsymbol{\Sigma_y}) given vector  \boldsymbol{y=x_1} + \boldsymbol{x_2}, we ca use affine property. Here we go.

First, we will modify the form \boldsymbol{y=x_1} + \boldsymbol{x_2} to (\boldmath{Ax}+\boldmath{b}).

\boldmath{y}=\boldmath{x_1}+\boldmath{x_1}\\\\ \boldmath{y}=\begin{bmatrix}1&1\end{bmatrix}\begin{bmatrix}x_1\\x_2\end{bmatrix}+\begin{bmatrix}0\\0\end{bmatrix}\\\\  \boldmath{y}=\boldmath{Ax}+\boldmath{b}

Thus, we get \boldsymbol{A}=\begin{bmatrix}1&1\end{bmatrix}, \boldsymbol{x}=\begin{bmatrix}x_1\\x_2\end{bmatrix} and \boldsymbol{b}=\begin{bmatrix}0\\0\end{bmatrix}. By using affine property, we can derive our new Gaussian distribution. Here we go.

(\boldmath{Ax}+\boldmath{b}) \sim \mathcal{N}(\boldmath{A\mu}+\boldmath{b},\boldmath{A\Sigma A}^T)\\\\  =\mathcal{N}(\begin{bmatrix}1&1\end{bmatrix}\mu+\begin{bmatrix}0\\0\end{bmatrix},\begin{bmatrix}1&1\end{bmatrix}\Sigma \begin{bmatrix}1\\1\end{bmatrix})\\\\  =\mathcal{N}(\begin{bmatrix}1&1\end{bmatrix}\begin{bmatrix}\boldmath{\mu_1}\\\boldmath{\mu_2} \end{bmatrix},\begin{bmatrix}1&1\end{bmatrix}\begin{bmatrix}\boldmath{\Sigma_{x_1}}&0\\0&\boldmath{\Sigma_{x_2}}\end{bmatrix}\begin{bmatrix}1\\1\end{bmatrix})\\\\  =\mathcal{N}(\boldmath{\mu_{x_1}}+\boldmath{\mu_{x_2}},\, \boldmath{\Sigma_{x_1}}+\boldmath{\Sigma_{x_1}})

Here we assume that \boldsymbol{x} and \boldsymbol{y} are independent, so, \boldsymbol{\Sigma} is a diagonal matrix (covariance = 0). Thus, our new Gaussian distribution is \mathcal{N}{y}(\boldsymbol{\mu_y},\boldsymbol{\Sigma_y}), where \boldsymbol{\mu_y}=\boldsymbol{\mu_{x_1}}+\boldsymbol{\mu_{x_2}} and \boldsymbol{\Sigma_y}=\boldsymbol{\Sigma_{x_1}}+\boldsymbol{\Sigma_{x_2}}. It makes sense, right? We have new variable which is sum of two variables, the new mean and variance are also sum of each mean and variance of them. This use of affine property then can be expanded to other combinations of Gaussian input data

 

2.2 Marginal Gaussian distribution

Other important properties of Gaussian are if two sets of variables are jointly Gaussian, then the marginal distribution on one set is also Gaussian, and the conditional distribution of one set given the other set is also Gaussian. Given marginal distribution below.

\boxed{marginal = P(x_a)=\int P(x_a, x_b)dx_b}

The intuition is, if we have joint probability P(x_a, x_b), we can marginalize/remove one of the variable of its variables. For example we want to marginalize x_b so that we get P(x_a). Given our P(x_a, x_b) is Gaussian, thus our marginal distribution P(x_a) is also Gaussian with new mean/expected value and covariance matix as follows.

E[x_a]=\mu_a\\\\  cov[x_a]=\Sigma_{aa}=(\Lambda_{aa}-\Lambda{{_{ab}}}\Lambda{_{bb}}^{-1}\Lambda{_{ba}})^{-1}

where \Lambda = \Sigma^{-1}, and called as precision matrix. 

 

2.3 Conditional Gaussian distribution

Given two set Gaussian \mathcal{N}(x|\boldsymbol{\mu_a,\Sigma_a}) and \mathcal{N}(x|\boldsymbol{\mu_b,\Sigma_b}), the conditional probability P(a|b) is as follows.

\boxed{P(a|b) = \mathcal{N}(x|\boldsymbol{\mu_{a|b},\Sigma_{a|b}})}

Where \mu_{a|b}=\mu_a+\Sigma_{ab}\Sigma_{bb}^{-1}(x_b-\mu_b) and \Sigma_{a|b}=\Sigma_{aa}-\Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}.

 

(3) Gaussian Mixture Model

In reality, sometimes we cannot model the probability distribution using single Gaussian model. See picture below.

Source : Bishop textbook “Pattern Recognition & Machine Learning”

Picture in the left, the green dots are an example of given data that cannot (will be bad) be modeled using single Gaussian distribution, because it has two summits. To model green dots distributions, we can use linear combination of Gaussian distributions (picture in the right), which is much better than single Gaussian distribution. This model combination of Gaussian is usually called GMM (Gaussian Mixture Model). For more clear understanding, see again picture below showing Gaussian mixture distribution in one dimension formed by three Gaussians.

Source : Bishop textbook “Pattern Recognition & Machine Learning”

We can form new distribution shown with red line above by combining three Gaussian distribution with the blue line. Gaussian mixture distribution can be formed by using equation below.

\boxed {P(x)=\sum_{i=1}^{k}\pi_k \mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu_k},\boldsymbol{\Sigma_k})}

Each Gaussian density \mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu_k},\boldsymbol{\Sigma_k}) is called a component of mixture and has its own mean \boldsymbol{\mu_k} and covariance \boldsymbol{\Sigma_k}. And parameters \pi_k are called mixing coefficients, and they sum to 1, to make the integral of Mixture Gaussian P(x) equals to 1.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s