Unsupervised Learning - Dimensionality Reduction and PCA

Dimensionality reduction is mainly used for data compression and data visualization.

Data Compression

Sometimes we have many features of the dataset. And some of them are very similar (e.g. inch and centimeter), which cause redundancy. So to save memory and disk and to speed up training in supervised learning, we need data compression by using dimensionality reduction.

Data Visualization

Sometimes, the dataset may have many features which we can not plot. So we use dimensionality reduction to reduce the dataset to 2D or 3D dimensions, then we can plot the dataset and see the relations between different data points to understand the data better.

Principal component analysis (PCA)

PCA is a very popular algorithm to do dimensionality reduction. It is trying to find a lower dimensional space (e.g. surface/line) to onto which to project the data, so as to minimize this squared projection error (squared distance between each point and the location of where it get projected)

To speak generally: Reduce from n-dimension to k-dimension: Find $k$ vectors (each one is a n-dimension vector) \({ u }^{ (1) },\quad { u }^{ (2) },\quad ...\quad { u }^{ (k) }\) onto which to project the data, so as to minimize the projection error.

So each data example \({ x }^{ (i) }\in\mathbb{R}^{ n }\) are transformed (reduced) to \({ z }^{ (i) }\in\mathbb{R}^{ k }\)

Data preprocessing

Assume that the training set has $m$ examples. For each feature $j$, we compute its mean ${\mu}_{j}=\frac { 1 }{ m } \sum _{ i=1 }^{ m }{ { x }_{ j }^{ (i) } }$ and replace each example \({ x }_{ j }^{ (i) }\) in this feature with \({ x }_{ j }^{ (i) }-{ u }_{ j }\) (in the following picture, there is a little mistake)

\({ s }_{ j }\) can be (max - min) value in the feature or standard deviation of the feature.

Apply PCA

We want to reduce data from n-dimensions to k-dimensions.

First, we need to compute "covariance matrix" (Sigma):

$$\Sigma \quad =\quad \frac { 1 }{ m } \sum _{ 1 }^{ m }{ ({ x }^{ (i) }) } { ({ x }^{ (i) }) }^{ T }$$

The Sigma covariance matrix is the size of \(n\times n\), since \({ { x }^{ (i) } }\) is the size of \(n\times 1\). This way we need to calculate each example one by one which is time-consuming. We can use vectorization, let dataset \(X\):

$$X=\begin{bmatrix} \cdots & { { x }^{ (1) } }^{ T } & \cdots \\ & \vdots & \\ \cdots & { { x }^{ (m) } }^{ T } & \cdots \end{bmatrix}$$

$$Sigma=\frac { 1 }{ m }{ X }^{ T }\cdot X$$

Then, we compute "eigenvectors" of matrix \(\Sigma\) using SVD (singular value decomposition).

$$[U,S,V]=svd(Sigma)$$

We only need to use matrix \(U\), and select the first \(k\) columns as matrix \({ U }_{ reduce }\). See the picture below:

Finally, we get the k-dimension vector \({ z }^{ (i) }\in\mathbb{R}^{ k }\) for each data example.

For the new dataset \(Z\), the \(j\) column (feature) \({ { z }_{ j } }\), we can get by:

$${ z }_{ j }={ (u }^{ (j) })^{ T }{ X }^{ T }$$

The whole procedure actually minimizes the square projection error. We don't give a mathematical proof here.

Reconstruction from compressed representation

Choosing the number of principal components.

If many features are very correlated (similar), then you can reduce the dimension a lot but still retain high variance (e.g.95%).

We can test \(k\) from 1 to the smallest value to make the above inequality true using either of following two methods.

The right one is suggested since it only calculates the SVD once, which is much more efficient.

Note: 1.To compress data for speeding up supervised learning, we only run PCA on the training data. We get the ${U}_{reduce}$ matrix from the training data. Then we can use this matrix to transform(map) the validation data and test data to reduce their dimensions. 2. PCA is not used for preventing overfitting. 3. First, use the original data, if training is too slow, then consider using PCA.

登录注册后参与评论