Principal Component Analysis

Wed 23 May 2018


Finds patterns to reduce the dimensions of the dataset with minimal loss of information.

Finds directions (components) that maximize the variance in our dataset as opposed to MDA (Multiple Discriminant Analaysis) which also finds direction but instead to maximize class separation.

PCA

  1. Find mean of each column.
  2. Create covariance matrix (covariance of each column with each other).
  3. Calculate the eigendecomposition of the covariance matrix.
  4. Eigenvectors are the directions or components for the reduced subspace
  5. Eigenvalues represent the magnitudes for the directions
  6. Rank the eigenvectors from high to low with corresponding eigenvalue and choose the top k eigenvectors
  7. If all eigenvalues are similar projection might not be effective as it's already compressed
  8. If there are eigenvalues close to zero, they represent components or axes that may be discarded
  9. Project the original data onto the new subspace using eigenvectors
from numpy import array
from sklearn.decomposition import PCA
# define a matrix with 2 components
A = array([[1, 2], [3, 4], [5, 6]])
# Number of components to keep = 1
pca = PCA(1)
pca.fit(A)
print('components', pca.components_)
print('variance', pca.explained_variance_)
B = pca.transform(A)
print('transformed', B)

components [[0.70710678 0.70710678]]
variance [8.]
transformed [[-2.82842712]
 [ 0.        ]
 [ 2.82842712]]

References
  1. https://machinelearningmastery.com/calculate-principal-component-analysis-scratch-python/
  2. https://sebastianraschka.com/Articles/2014_pca_step_by_step.html
  3. https://plot.ly/ipython-notebooks/principal-component-analysis/
  4. https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html