11.12. PCA Data Classification Project

An important application of PCA is classification, which reduces the dimensions of the data for the purpose of making it easier to see how the attributes of items of the same type are similar and differ from items of other types. The result of the analysis is often a plot showing the data in the PCA space.

In this assignment, we will look at a public domain dataset and produce a scatter plot showing how groups of the data are clustered in the PCA space. Refer to the example in PCA for Classification as a guide to see how the PCA algorithm is applied.

Iris Plant Dataset

This is a quite old dataset. It contains four attributes of three types of Iris flower plants. The data was collected by R.A. Fisher for a paper published in 1950. Information about the dataset is found on the UCI Machine Learning Repository.

Use PCA to reduce the dimensionality of the data from four attributes to two principal components. Make a scatter plot of the samples in two dimensional PCA space.

iris_data.csv

Submit both your Python script and PNG image file of your scatter plot.

Note

Python functions exist that will do all of the PCA calculations for you. While it may be fine to use those functions in other contexts, it is not acceptable for this assignment. To receive points for the assignment, you must compute the PCA space from either eigenvectors or the SVD.