What is PCA?
PCA is a dimensionality-reduction approach for reducing the dimensionality of big data sets by converting a large collection of variables into a smaller one that retains the majority of the information in the large set.
Naturally, reducing the number of variables in a data set reduces accuracy; nevertheless, the answer to dimensionality reduction is to exchange some accuracy for simplicity. Because smaller data sets are simpler to study and display, and because machine learning algorithms can analyze data more easily and quickly without having to deal with superfluous factors.
- PCA’s goal is to decrease the number of variables in a data collection while retaining as much information as feasible.
Standardize
This phase is used to normalize the range of continuous beginning variables so that they all contribute equally to the analysis.
The importance of standardization before PCA is due to the latter’s sensitivity to the variances of the initial variables. That is, if the ranges of starting variables differ significantly, the variables with wider ranges will dominate over those with smaller ranges, resulting in biased results. As a result, converting the data to equivalent scales might help to avoid this issue.
Subtracting the mean and dividing by the standard deviation for each value of each variable may be done mathematically.
Covariance
The goal of this stage is to figure out how the variables in the input data set differ from the mean concerning one another, or to discover whether there is a link between them. Because variables might be highly connected to the point where they include duplicated data. We construct the covariance matrix to find these associations.
What can we learn about the correlations between the variables from the covariances that we have as matrix entries?
- It’s the sign of the covariance that’s important:
- If the two variables are positive, they rise or decrease together; if they are negative, one increases while the other drops.
Compute eigenvectors and eigenvalues
Eigenvectors and eigenvalues are always found in pairs, with an eigenvalue for each eigenvector. And the number of them is equal to the number of data dimensions. There are three variables in a three-dimensional data set, hence there are three eigenvectors with three associated eigenvalues.
Because the eigenvectors of the covariance matrix are the directions of the axis with the largest variance, which we call Principal Components, eigenvectors and eigenvalues are behind all the magic. The coefficients associated with eigenvectors represent the amount of variation held in each Principal Component, while eigenvalues are just the coefficients attached to eigenvectors.
You may acquire the principal components in order of importance by ranking your eigenvectors in order of their eigenvalues, from highest to lowest.
Feature vector
In this stage, we decide whether to preserve all of these components or to reject the ones that aren’t as important and then combine the remaining ones into a matrix of vectors known as the Feature vector.
So, the feature vector is just a matrix with the eigenvectors of the components we want to maintain as columns.
Recast the data
Apart from standardization, there are no changes to the data in the preceding phases; you just choose the primary components and build the feature vector, but the input data set is always in terms of the original axes.
The goal of this phase, which is the last one, is to reorient the data from the original axes to the ones indicated by the main components using the feature vector produced using the eigenvectors of the covariance matrix.
Wrap up
Although PCA is a frequently used and adaptive descriptive data analysis technique in its basic form, it also has various modifications that make it applicable to a wide range of circumstances and data types in several fields. PCA adaptations have been proposed for binary data, ordinal data, compositional data, discrete data, symbolic data, or data with a unique structure, such as time series or datasets with shared covariance matrices, among other things. Other statistical methods, such as linear regression with principal component regression and even simultaneous clustering of both persons and variables, have relied heavily on PCA or PCA-related methodologies.
Although methods like linear discriminant analysis, correspondence analysis, and canonical correlation analysis are only loosely related to PCA, they have a common methodology in that they are based on factorial decompositions of particular matrices. The literature on PCA is extensive and covers a wide range of topics. Due to space limits, it has only scratched the surface here. New modifications, methodological findings, and applications continue to emerge.