Principal Component Analysis

Wed 23 May 2018

Finds patterns to reduce the dimensions of the dataset with minimal loss of information.

Finds directions (components) that maximize the variance in our dataset as opposed to MDA (Multiple Discriminant Analaysis) which also finds direction but instead to maximize class separation.

PCA

Find mean of each column.
Create covariance matrix (covariance of each column with each other).
Calculate the eigendecomposition of the covariance matrix.
Eigenvectors are the directions or components for the reduced subspace
Eigenvalues represent the magnitudes for the directions
Rank the eigenvectors from high to low with corresponding eigenvalue and choose the top k eigenvectors
If all eigenvalues are similar projection might not be effective as it's already compressed
If there are eigenvalues close to zero, they represent components or axes that may be discarded
Project the original data onto the new subspace using eigenvectors

from numpy import array
from sklearn.decomposition import PCA
# define a matrix with 2 components
A = array([[1, 2], [3, 4], [5, 6]])
# Number of components to keep = 1
pca = PCA(1)
pca.fit(A)
print('components', pca.components_)
print('variance', pca.explained_variance_)
B = pca.transform(A)
print('transformed', B)

components [[0.70710678 0.70710678]]
variance [8.]
transformed [[-2.82842712]
 [ 0.        ]
 [ 2.82842712]]

PCA

References