Previous: SOM, Up: Cluster


4.4 Principal Component Analysis

Principal Component Analysis (PCA) is a widely used technique for analyzing multivariate data. In PCA, the data vectors are written as a linear sum over principal components. The number of principal components is equal to the number of dimensions of the data vectors.

The principal components are chosen such that they maximally explain the variance in the data vectors. For example, in case of 3D data vectors, the data can be represented as an ellipsoidal cloud of points in three dimensional space. The first principal component would be the longest axis of the ellipsoid, the second principal component would be the second longest axis of the ellipsoid, and the third principal component would be the shortest axis. In other words, the principal components are ordered by the amount of variance they explain.

Each data point can be reconstructed by a suitable linear combination of the principal components. However, in order to reduce the dimensionality of the data, usually only the most important principal components are used. The remaining variance present in the data is then regarded as unexplained variance.

The principal components can be found by calculating the eigenvectors of the covariance matrix of the data. The corresponding eigenvalues determine how much of the variance present in the data is explained by each principal component.

In Cluster, the eigenvectors are found by calculating the singular value decomposition of the data matrix. The output is very simple in this version and consists of two files: JobName_svv.txt that contains the principal components and JobName_svu.txt that contains the loadings of each gene on the principal components. By multiplying the principal components by the loadings and summing, the data vector of each gene can be recovered.

A practical example of applying Principal Component Analysis to gene expression data is presented by Yeung and Ruzzo (2001).