10. Principal Component Analysis (PCA)¶
As the name implies, Principal Component Analysis (PCA) accentuates the
principal components of data such that the subtle but salient features of data
are more apparent. Like the SVD, PCA reduces the dimensionality of data
emphasizing the distinguishing characteristics while removing the uncorrelated
noise. The PCA algorithm maps the data into new axes called the PCA space.
From the PCA space, the data may be reconstructed with reduced dimension as
the rank- sum of SVD outer products accomplishes. However, it is more
common to use the PCA space data for classification
applications. In the PCA space, similar items are grouped together and
separated from clusters of dissimilar items.
Suppose that you wish to automate the classification of items into groups based on measurable features. PCA can use correlations between the features to find a linear combination of the features such that similar items are clustered together in the reduced dimension PCA space. PCA uses the most significant eigenvectors found from a covariance matrix to find the linear combinations of the features. Then, when we multiply sample data by the most significant eigenvectors (determined by the eigenvalues), the differences between data sets become more obvious allowing for source identification.
If you have measured features from
sources (samples) then
the data can be arranged in a matrix. Examples of potential data sets include
the stock values from
companies taken over
days, opinions
of
customers regarding
competing products, or the grades of
students in
courses. In the examples presented here, the
matrix columns hold the measured features and each row holds the data taken
from one source. Although with adjustments to equations, the data may be
transposed with the features in rows and the samples in columns.
Use of a covariance matrix is significant to PCA as it reveals how similar and dissimilar (at variance) sets of data are to each other. (See Covariance and Correlation)
The first step in the algorithm is to subtract the mean of each feature from the data so that each feature has a zero mean. This is needed to make a covariance matrix.
Recall from Covariance and Correlation that a covariance matrix is symmetric with the variance of each feature on the diagonal and the covariance between features in the off-diagonal positions.
Next, the eigenvectors, , and eigenvalues of the covariance
matrix are found. The eigenvalues are used to sort the eigenvectors. All
eigenvalues could be used, but because we wish to achieve dimensionality
reduction, only the eigenvectors with the largest eigenvalues are used. The
PCA space is the
most significant eigenvectors. The sub-set of the
eigenvectors gives us the mapping of the data onto the PCA space.
Note
The computation performed to find the
eigenvector matrix
from the
matrix is the same as finding the
sub-matrix of the SVD (
[~, ~, V] = svd(D)
).
gives us the linear combinations of each data feature to
project the data onto the PCA space, which is found by multiplying
the
matrix by each sample vector.
For purposes of classification and object recognition, the data in the PCA
space () is used to display or compare the data.
To reconstruct the original data with reduced dimensionality, we just reverse the steps.
Transposed Data
If the measured features are held in the rows with the samples in the columns, then adjusted equations are as follows.
10.1. PCA for Data Analysis¶
To illustrate what PCA does with a simple plot, the following code shows a PCA example with only two data sets. Noise was added to the data to show how dimensionality reduction separates the essence of the data from the uncorrelated noise. In this example, the data was reconstructed with reduced dimensionality. Notice the lines showing the principal and minor axes. The orientation of the principal axis is aligned with the correlation of the data, while the minor axis is aligned with the uncorrelated noise. Thus, as we also saw in (Dimensionality Reduction) for the SVD, the lower rank data is preferred to the full rank data that contains noise. Fig. 10.1 shows a plot demonstrating that the reconstructed data is aligned with the correlation between the columns of the original data.
Note
- The MATLAB Statistics and Machine Learning Toolbox has a PCA function, but the direct calculation is only a few lines of code.
- The
PCA_demo
example uses therepmat
MATLAB function that that has been previously used. Therepmat
function creates a matrix with copies of a scalar, vector, or matrix. In this case, the variablemu
is a row vector, sorepmat(mu, m, 1)
makes a matrix withrows, each holding one copy of the
mu
vector. This allows use of element-wise subtraction of the mean from each column.
% File: PCA_demo.m
% PCA - 2D Example
n = 2;
m = 50;
% two columns random but correlated data
X = zeros(m, n);
X(:,1) = 5*randn(m, 1) - 2.5;
X(:,2) = -1 + 1.5 * X(:,1) + 5*randn(m,1);
X = sortrows(X,1);
%% find PCA from eigenvectors of covariance
mu = mean(X);
D = X - repmat(mu, m, 1);
covx = D'*D/(m - 1); % sources in rows, features in columns
[V, S] = eig_decomp(covx);
% project onto PCA space
W = V(:,1); % only use dominant eigenvector
Y = D * W;
disp('Eigenvalue ratios to total')
disp(S/trace(S))
% Data reconstruction
Xtilde = Y*W' + repmat(mu, m, 1);
% plots
figure, scatter(X(:,1), X(:,2)), title('PCA Demo');
hold on
plot([0 5*V(1,1)], [0 5*V(2,1)]); % principal axis
plot([0 5*V(1,2)], [0 5* V(2,2)]); % minor axis
plot(Xtilde(:,1), Xtilde(:,2)); % principal component
legend({'data in', 'principal axis', 'minor axis', ...
'reconstruction'}, 'Location', 'northwest');
hold off
function [eigvec,eigval]=eig_decomp(C)
[eigvec,eigval]=eig(C);
eigval=abs(diag(eigval)');
[eigval, order]=sort(eigval, 'descend');
eigval=diag(eigval);
eigvec=eigvec(:, order);
end

Fig. 10.1 The principal component of the reconstructed data from PCA is aligned with the correlation between the columns of the original data.
Is PCA the same as Least Squares Regression?
The plot in Fig. 10.1 makes it appear that reconstructed data from PCA does the same thing as least squares regression; however, it differs in two ways.
- In least squares regression, the error that is minimized is parallel to
the
–axis, in PCA the minimized error is orthogonal to the data projection line (principal axis).
- In least squares regression, the data fit is an extension of solving a system of linear equations for over-determined systems. In PCA, the data fit is a by-product of dimensionality reduction.
10.2. PCA for Clustering¶
An important application of PCA is classification, which reduces the dimensions of the data for the purpose of making it easier to see how the attributes of items of the same type are similar and differ from items of other types. The result of the analysis is often a plot showing the data in the PCA space. In this example, we will look at a public domain dataset and produce a scatter plot showing how groups of the data are clustered in the PCA space.
The dataset can be downloaded from the Data World web site. It comes from the USDA National Nutrient Database and was made available by Craig Kelly. It list many different foods and identifies food groups, nutrition, and vitamin data for each food. We will use the nutrition information in the PCA analysis. The scatter plot in Fig. 10.2 shows how different food groups are similar. For example, it shows that some fruits and most grains are similar to sweets, probably because they are also high in carbohydrates. Some of the food groups that are contained in the data are not plotted.
To see a list of all food groups in the data set, type the following
after the FoodGroups
variable is created by the script.
>> categories(FoodGroups)
Download the dataset and my MATLAB script.
10.3. PCA for Recognition¶
A fun example of how PCA can be used is for face recognition. The eigenvectors found from PCA give us a matrix for multiplying images of peoples faces. These resulting matrices, or vectors – depending on the implementation, make a more succinct representation of whom is pictured in each image.
An example implementation is found on the MathWorks File Exchange web site. Two dimensional PCA for face recognition [ALSAQRE19]
Since, like SVD, PCA reduces the dimensionality of the data from rank
to rank
, for purposes of comparing training and testing
data, row or column coefficients of the data beyond
may either be
removed (making smaller matrices) or set to zero. Reducing the
dimensionality is often desired to make the matrix more manageable.
Additional reading
Here is a very good tutorial on PCA
Principal_Component_Analysis_Tutorial.pdf
by Alaa Tharwat that
I found on MathWorks File Exchange ([THARWAT16]). The tutorial shows both
the SVD and covariance PCA algorithms. I will use the covariance
algorithm. Figure 2 of the tutorial shows a diagram of the covariance
algorithm.
Homework
Now complete PCA Data Classification homework.