Principal Components Analysis
Getting quick intuition of what is PCA and how to used it
Intuition
Curse of dimensionality
A dataset consists of a set of samples or observations, each characterized by features (a.k.a. attributes). These features describe the samples. For example, in a dataset about flowers, characteristics like petal length and petal width could be the features. Typically, in a tabular dataset, the rows represent the samples, often identified by a unique ID, and the columns represent the features.
The number of features (columns) a dataset has for each observation (rows) is called its dimensionality.
If we have a flower dataset with three features/dimensions, it can be plotted in a 3D space, known as the feature space. Note that the feature space concept still applies to datasets with more than three dimensions.
As for the arrangement of data points within this space, whether they are closely or sparsely packed, this characteristic is known as the space distribution or data distribution.
When preparing original (raw) data to feed into a ML, DL model, sometimes the high number of dimensions compared to the number of unique samples can affect the model’s ability to learn meaningful patterns, leading to lower performance in tasks like regression or classification. This problem is known as the curse of dimensionality.
Therefore, it is the task of the practitioner in the data preprocessing stage to search for the best strategies to mitigate this issue. One approach could be to reduce the number of dimensions; one method that helps achieve this is Principal Component Analysis (PCA).
Principal Component Analysis (PCA)
PCA is an unsupervised machine learning algorithm commonly used by practitioners to reduce dimensionality.
The algorithm will create a new representation (features) of your data points (samples) by creating new orthogonal axes, called Principal Components (PC), and fitting them in a way that captures the maximum variability of the original data distribution.
The algorithm already gives you information on how much variability has been captured for each PC; this helps you to decide how many PCs (new dimensions) you can use to replace the original data features.
Let’s say you need to reduce dimensionality, and the first Principal Component (PC1) captures >80% of the variability. It could be a good approach to play and test (experiment) whether PC1, as the only feature representation of each data point, will improve your model’s performance.