Principal Component Analysis (PCA) is a commonly used technique that uses the correlation structure of the original variables to reduce the dimensionality of the data. This reduction is achieved by considering only the first few principal components for a subsequent analysis. The usual inclusion criterion is defined by the proportion of the total variance of the principal components exceeding a predetermined threshold. We show that in certain classification problems, even extremely high inclusion threshold can negatively impact the classification accuracy. The omission of small variance principal components can severely diminish the performance of the models. We noticed this phenomenon in classification analyses using high dimension ECG data where the most common classification methods lost between 1 and 6% of accuracy even when using 99% inclusion threshold. However, this issue can even occur in low dimension data with simple correlation structure as our numerical example shows. We conclude that the exclusion of any principal components should be carefully investigated.

Principal Component Analysis (PCA) (

The ECG waveform and segments in lead II that presents a normal cardiac cycle.

One heartbeat ECG presented by 100%, 20%, 40%, and 60% respectively.

Accuracy* comparison between classification models using original variables and principal components**.

Random Forest | 0.96 | 0.92 | –0.04 |

Conditional Random Forest | 0.96 | 0.90 | –0.06 |

Naive Bayes | 0.92 | 0.87 | –0.05 |

Multinomial Logistic Regression | 0.94 | 0.94 | –0.001 |

Quadratic Discriminant Analysis | 0.93 | 0.90 | –0.02 |

* Accuracy is the average of 10 stratified folds.

** Principal components accounting for 99% of the variance used.

Here is a mathematical description of data scenarios where this phenomenon can occur. Let Σ be the covariance matrix of the original variables _{1}, _{2}, …, _{p}_{1}, _{1}), (λ_{2}, _{2}) …, (λ_{p}_{p}_{1} ≥ λ_{2} ≥ … ≥ λ_{p}

Now assume that we have a classification problem with two groups. Let _{i}, i_{i}_{i}_{1}, _{i}_{2}, …, _{ip}

where _{0}, _{1}, …, _{j}

Therefore, in its classical dimensionality reduction implementation, PCA, might not be useful for certain classification problems. In particular, in classification problems with complex patterns the lower ranked PCs are the ones that carry the information about group differences as the first several main PCs that reflect the correlation structure of the complex mean pattern and do not contain enough information about subtle group differences. Thus, if PCA is employed, we recommend that the PC inclusion thresholds should be carefully considered and based not only on the proportion of explained variance but also on the magnitude of the variance of the excluded PCs and the power to detect effect size of certain magnitude given the sample size (_{s}_{+1} = (_{1}_{s}_{+1}, _{2}_{s}_{+1}, …, _{n s}_{+1}) (with variance λ_{s}_{+1}) for inclusion in subsequent analysis where the first

where _{1–α/2} is (1 – α/2)100 – _{s}_{+1}, and Ф is the cumulative density function of the standard normal distribution.

The ANOVA decomposition of the total sums of squares yields,

where

Letting

Without loss of generality we can assume that the overall mean _{1} and _{2}. Then, from the condition that the overall mean is zero and (5) we deduce that _{2} = _{1}

which is always positive. Clearly, from (3) we get,

This result reveals that any principal component with arbitrarily small variance can have a statistically significant effect with respect to classification which can produce subsequent improvement in the area under the ROC curve and should not be disregarded without further investigation.

We highlight the results through a numerical example. The following positive definite covariance matrix,

has eigenvalues 419.3, 75.8, 40.8, 3.1 and the first two PCs account for 91.9% of the total variance. The usual dimensionality reduction approach will use the first two PCs for further analysis and disregard the last two. Let the true model for the binary class assignment be given by _{i}_{1}_{i}_{3}. For effect sizes _{1} =

Areas under the ROC curve for both models.

log(2)/8 | 0.53 | 0.64 |

log(2)/4 | 0.53 | 0.76 |

log(2)/2 | 0.54 | 0.88 |

log(2) | 0.54 | 0.95 |

2 | 0.55 | 0.99 |

* Empirically estimated via 10,000 datasets.

It is clear that even the two smallest effect sizes of

In this work we show a potential performance problem of classification algorithms carried out after preliminary dimensionality reduction step via PCA. These scenarios can occur even in simple, low dimensional data cases as our numerical example reveals. However, the issue can regularly arise with higher dimension data that possesses complex patterns and multiple groups. In such cases, the main PCs capture the covariance pattern of combined data while the the lower ranked PCs capture the information about group differences and are therefore vital for classification accuracy. Our results show that PCA with inclusion thresholds based on proportion of total variance explained often decreases classification accuracy even with extremely high inclusion threshold. Thus, we suggest using all PCs in classification problem in order to avoid the omission of PCs with lower ranking that are important classification predictors. In such cases, the benefit of the not using the original variables and switching to PCA might come from the fact that the PCs are uncorrelated and that might be advantagous in certain model building algorithms.

We are grateful for the support of the Kay Family Foundation.

The authors have no competing interests to declare.