10. Principal Component Analysis (PCA)

As the name implies, Principal Component Analysis (PCA) accentuates the principal components of data such that the subtle but salient features of data are more apparent. Like the SVD, PCA reduces the dimensionality of data emphasizing the distinguishing characteristics while removing the uncorrelated noise. The PCA algorithm maps the data into new axes called the PCA space. From the PCA space, the data may be reconstructed with reduced dimension as the rank-k sum of SVD outer products accomplishes. However, it is more common to use the PCA space data for classification applications. In the PCA space, similar items are grouped together and separated from clusters of dissimilar items.

Suppose that you wish to automate the classification of items into groups based on measurable features. PCA can use correlations between the features to find a linear combination of the features such that similar items are clustered together in the reduced dimension PCA space. PCA uses the most significant eigenvectors found from a covariance matrix to find the linear combinations of the features. Then, when we multiply sample data by the most significant eigenvectors (determined by the eigenvalues), the differences between data sets become more obvious allowing for source identification.

If you have n measured features from m sources (samples) then the data can be arranged in a matrix. Examples of potential data sets include the stock values from m companies taken over n days, opinions of m customers regarding n competing products, or the grades of m students in n courses. In the examples presented here, the matrix columns hold the measured features and each row holds the data taken from one source. Although with adjustments to equations, the data may be transposed with the features in rows and the samples in columns.

Use of a covariance matrix is significant to PCA as it reveals how similar and dissimilar (at variance) sets of data are to each other. (See Covariance and Correlation)

The first step in the algorithm is to subtract the mean of each feature from the data so that each feature has a zero mean. This is needed to make a covariance matrix.

\mathbf{D} = \mat{\vertbar{}, \vertbar{}, {}, \vertbar{};
(\bm{x}_1-\mu_1){}, (\bm{x}_2-\mu_2){},  \cdots{},
      (\bm{x}_n-\mu_n){};
              \vertbar{}, \vertbar{}, {}, \vertbar{}}

Recall from Covariance and Correlation that a covariance matrix is symmetric with the variance of each feature on the diagonal and the covariance between features in the off-diagonal positions.

\mathbf{Cov_x} = \frac{\mathbf{D}^T\,\mathbf{D}}{m - 1}

\mathbf{Cov_x} =
\mat{Var({x_1, x_1}), Cov({x_1, x_2}), \cdots, Cov({x_1, x_n});
     Cov({x_2, x_1}), Var({x_2, x_2}), \cdots, Cov({x_2, x_n});
     \vdots,        \vdots,        \ddots, \vdots;
     Cov({x_n, x_1}), Cov({x_n, x_2}), \cdots, Var({x_n, x_n})}

Next, the eigenvectors, \mathbf{V}, and eigenvalues of the covariance matrix are found. The eigenvalues are used to sort the eigenvectors. All eigenvalues could be used, but because we wish to achieve dimensionality reduction, only the eigenvectors with the largest eigenvalues are used. The PCA space is the k most significant eigenvectors. The sub-set of the eigenvectors gives us the mapping of the data onto the PCA space.

\mathbf{W} = \mat{\bm{v}_1, \cdots, \bm{v}_k}

Note

The computation performed to find the \bf{V} eigenvector matrix from the \bf{D} matrix is the same as finding the \bf{V} sub-matrix of the SVD ([~, ~, V] = svd(D)).

\bf{W} gives us the linear combinations of each data feature to project the data onto the PCA space, which is found by multiplying the \bf{W} matrix by each sample vector.

\mathbf{Y} = \mathbf{D\,W}.

For purposes of classification and object recognition, the data in the PCA space (\bf{Y}) is used to display or compare the data.

To reconstruct the original data with reduced dimensionality, we just reverse the steps.

\begin{array}{ll}
\mathbf{\tilde{X}} &= \mathbf{D} + \bm{\mu} \\
        &= \mathbf{W\,(W}^T\,\mathbf{D}) + \bm{\mu} \\
        &= \mathbf{W\,Y} + \bm{\mu}
  \end{array}

10.1. PCA for Data Analysis

To illustrate what PCA does with a simple plot, the following code shows a PCA example with only two data sets. Noise was added to the data to show how dimensionality reduction separates the essence of the data from the uncorrelated noise. In this example, the data was reconstructed with reduced dimensionality. Notice the lines showing the principal and minor axes. The orientation of the principal axis is aligned with the correlation of the data, while the minor axis is aligned with the uncorrelated noise. Thus, as we also saw in (Dimensionality Reduction) for the SVD, the lower rank data is preferred to the full rank data that contains noise. Fig. 10.1 shows a plot demonstrating that the reconstructed data is aligned with the correlation between the columns of the original data.

Note

  • The MATLAB Statistics and Machine Learning Toolbox has a PCA function, but the direct calculation is only a few lines of code.

  • The PCA_demo example uses the repmat MATLAB function that that has been previously used. The repmat function creates a matrix with copies of a scalar, vector, or matrix. In this case, the variable mu is a row vector, so repmat(mu, m, 1) makes a matrix with m rows, each holding one copy of the mu vector. This allows use of element-wise subtraction of the mean from each column.

% File: PCA_demo.m
% PCA - 2D Example
n = 2;
m = 50;
% two columns random but correlated data
X = zeros(m, n);
X(:,1) = 5*randn(m, 1) - 2.5;
X(:,2) = -1 + 1.5 * X(:,1) + 5*randn(m,1);
X = sortrows(X,1);

%% find PCA from eigenvectors of covariance
mu = mean(X);
D = X - repmat(mu, m, 1);
covx = D'*D/(m - 1);      % sources in rows, features in columns
[V, S] = eig_decomp(covx);
% project onto PCA space
W = V(:,1);               % only use dominant eigenvector
Y = D * W;
disp('Eigenvalue ratios to total')
disp(S/trace(S))
% Data reconstruction
Xtilde = Y*W' + repmat(mu, m, 1);
% plots
figure, scatter(X(:,1), X(:,2)), title('PCA Demo');
hold on
plot([0 5*V(1,1)], [0 5*V(2,1)]);  % principal axis
plot([0 5*V(1,2)], [0 5* V(2,2)]); % minor axis
plot(Xtilde(:,1), Xtilde(:,2));    % principal component
legend({'data in', 'principal axis', 'minor axis', ...
    'reconstruction'}, 'Location', 'northwest');
hold off

function [eigvec,eigval]=eig_decomp(C)
    [eigvec,eigval]=eig(C);
    eigval=abs(diag(eigval)');
    [eigval, order]=sort(eigval, 'descend');
    eigval=diag(eigval);
    eigvec=eigvec(:, order);
end
../_images/PCA_plot.png

Fig. 10.1 The principal component of the reconstructed data from PCA is aligned with the correlation between the columns of the original data.

10.2. PCA for Clustering

An important application of PCA is classification, which reduces the dimensions of the data for the purpose of making it easier to see how the attributes of items of the same type are similar and differ from items of other types. The result of the analysis is often a plot showing the data in the PCA space. In this example, we will look at a public domain dataset and produce a scatter plot showing how groups of the data are clustered in the PCA space.

The dataset can be downloaded from the Data World web site. It comes from the USDA National Nutrient Database and was made available by Craig Kelly. It list many different foods and identifies food groups, nutrition, and vitamin data for each food. We will use the nutrition information in the PCA analysis. The scatter plot in Fig. 10.2 shows how different food groups are similar. For example, it shows that some fruits and most grains are similar to sweets, probably because they are also high in carbohydrates. Some of the food groups that are contained in the data are not plotted.

To see a list of all food groups in the data set, type the following after the FoodGroups variable is created by the script.

>> categories(FoodGroups)

Download the dataset and my MATLAB script.

nndb_flat.csv

PCA_food.m

../_images/PCA_food.png

Fig. 10.2 Scatter plot of first two principal components from nutrition data for several foods from seven food groups.

10.3. PCA for Recognition

A fun example of how PCA can be used is for face recognition. The eigenvectors found from PCA give us a matrix for multiplying images of peoples faces. These resulting matrices, or vectors – depending on the implementation, make a more succinct representation of whom is pictured in each image.

An example implementation is found on the MathWorks File Exchange web site. Two dimensional PCA for face recognition [ALSAQRE19]

Since, like SVD, PCA reduces the dimensionality of the data from rank r to rank k, for purposes of comparing training and testing data, row or column coefficients of the data beyond k may either be removed (making smaller matrices) or set to zero. Reducing the dimensionality is often desired to make the matrix more manageable.