On the Application of Principal Component Analysis to Classification Problems

Principal Component Analysis (PCA) is a commonly used technique that uses the correlation structure of the original variables to reduce the dimensionality of the data. This reduction is achieved by considering only the first few principal components for a subsequent analysis. The usual inclusion criterion is defined by the proportion of the total variance of the principal components exceeding a predetermined threshold. We show that in certain classification problems, even extremely high inclusion threshold can negatively impact the classification accuracy. The omission of small variance principal components can severely diminish the performance of the models. We noticed this phenomenon in classification analyses using high dimension ECG data where the most common classification methods lost between 1 and 6% of accuracy even when using 99% inclusion threshold. However, this issue can even occur in low dimension data with simple correlation structure as our numerical example shows. We conclude that the exclusion of any principal components should be carefully investigated.


INTRODUCTION
Principal Component Analysis (PCA) (Du et al., 2012;Hsieh et al., 2010;Mehmet Korürek, 2010;Kim et al., 2009) is a popular tool for data dimensionality reduction in the presence of complex correlation structure among a large number of numerical variables. The presence of correlations among the original variables in the data can be used to create new summary variables, principal components (PCs), that are optimal, uncorrelated linear combinations of the original variables. The optimality is represented by the fact that the PCs have the maximum possible variance among all linear combinations of the original variables and thus contain the maximum amount of information. The lack of correlation among the PCs removes the redundancy present in the original variables. The well-known lemma for maximization of quadratic forms for points on the unit sphere shows that the vectors of coefficients that define the PCs are the eigenvectors of the variance matrix. The eigenvalues associated with the eigenvectors equal the variance of the PCs and define an order among all PCs. The ones with the largest variance are considered the main PCs and provide an scheme for dimensionality reduction, and we take the first few PCs that jointly account for more than 80% or 90% of the total variance of the original variance. This approach makes intuitive sense as the PCs associated with the smallest eigenvalues are almost constant and thus have limited classification capability. However, in certain problems dimensionality reduction via PCA with even high cutoff for exclusion is not a good idea. This phenomenon was noticed when we implementing an arrhythmia classification on ECG data, even though some of studies demonstrated the PCA application on same research Mittal, 2019b, 2018b;Gupta et al., 2020;Gupta and Mittal, 2018a, 2016, 2019a. The ECG graph of a normal beat (shown in Figure 1) consists of a sequence of waves, a P-wave presenting the atrial depolarization process, a QRS complex denoting the ventricular depolarization process, and a T-wave representing the ventricular repolarization. Our data consisted of 200 data points per heart beat with complex correlation structure that seemed ideal for preliminary PCA dimensionality reduction step before subsequent classification approach was employed. However, using PCA exclusion cutoffs of 90%, 92%, 95%, 99% for the 200 PCs dramatically improves classification accuracy rate. The PCA application processed a segment of ECG presented one time heartbeat is depicted in Figure 2. This is an example revealing that PCA may not be a good idea for certain types of classification problems. A more detailed results that Figure 1 The ECG waveform and segments in lead II that presents a normal cardiac cycle.

3
Zheng and Rakovski Data Science Journal DOI: 10.5334/dsj-2021-026 highlight this finding are shown in Table 1. We can see that the loss of classification accuracy using five common classification algorithms (random forest, conditional random forest, naive Bayes, multinomial logistic regression, and quadratic discriminant analysis) using the original ECG data and principal components accounting for 99% of the total variance was between 0.001 and 0.06. In subsequent presentation we show that omission of even the lowest ranked PCs can be disadvantageous to the classification accuracy of the algorithm.

METHODS
Here is a mathematical description of data scenarios where this phenomenon can occur. Let Σ be the covariance matrix of the original variables x 1 , x 2 , ⋯, x p and (λ 1 , e 1 ), (λ 2 , e 2 ) ⋯, (λ p , e p ) be the eigenvalue-eigenvector pairs where λ 1 ≥ λ 2 ≥ ⋯ ≥ λ p . Then, the PCs are The classical approach (Johnson and Wichern, 1988) for dimensionality reduction is to select the first s major PCs that jointly account for at least, say m * 100% of the total variance of the original variables, Now assume that we have a classification problem with two groups. Let G i , i = 1, 2, ⋯, n be dichotomous variables that denote the group classification. Assume that the true underlying model describing the associations between G i and y i1 , y i2 , ⋯, y ip are given by the following logistic model, where β 0 , β 1 , ⋯, β j are the true effect sizes and 1 ≤ j ≤ p -s. It is clear that under these conditions, the classification will be poor due to the exclusion of the true predictors from the data at the preprocessing step of dimensionality reduction. That omission entails low classification accuracy based on spurious association between the group and noise variables or no detectable classification capability at all. Therefore, in its classical dimensionality reduction implementation, PCA, might not be useful for certain classification problems. In particular, in classification problems with complex patterns the lower ranked PCs are the ones that carry the information about group differences as the first several main PCs that reflect the correlation structure of the complex mean pattern and do not contain enough information about subtle group differences. Thus, if PCA is employed, we recommend that the PC inclusion thresholds should be carefully considered and based not only on the proportion of explained variance but also on the magnitude of the variance of the excluded PCs and the power to detect effect size of certain magnitude given the sample size (Schoenfeld D. A., 2005;F.Y. Hsieh and Larsen, 1998). In particular, if we consider y s+1 = (y 1s+1 , y 2s+1 , …, y n s+1 ) (with variance λ s+1 ) for inclusion in subsequent analysis where the first l and subsequent n -l subjects belong to groups 1 and 2 respectively. Let π(δ) denote the power to detect a difference of size δ between the group means subject to the restriction imposed by the fixed variance of the (s+1)-th PC. We will show that π(δ) can be arbitrarily close to 1. It is clear that, , are the variances of two groups, z 1-α/2 is (1α/2)100 -th percentile of the standard normal distribution, +1 s y is the mean of vector y s+1 , and Ф is the cumulative density function of the standard normal distribution.
The ANOVA decomposition of the total sums of squares yields, Without loss of generality we can assume that the overall mean +1 s y is zero and that the means of the first group and second groups are d 1 and -d 2 . Then, from the condition that the overall mean is zero and (5) we deduce that d 2 = d 1 l/(n -l) and l + = -- This result reveals that any principal component with arbitrarily small variance can have a statistically significant effect with respect to classification which can produce subsequent improvement in the area under the ROC curve and should not be disregarded without further investigation.

RESULTS
We highlight the results through a numerical example. The following positive definite covariance matrix,  419.3, 75.8, 40.8, 3.1 and the first two PCs account for 91.9% of the total variance. The usual dimensionality reduction approach will use the first two PCs for further analysis and disregard the last two. Let the true model for the binary class assignment be given by Logit(P(G i = 1)) = 0.5 + β 1 y i3 . For effect sizes β 1 = log(2)/4, log(2)/2, log(2), 2 the average areas under the ROC curve (averaged over 10,000 simulated datasets containing 500 subjects) for a logistic regression model that uses PC1 and PC2 were 0.53, 0.54, 0.54, 0.55 and while the corresponding values for a model using PC3 were 0.76, 0.88, 0.95, 0.99. Summary of the results is shown in Table 2.
It is clear that even the two smallest effect sizes of log(2)/8 and log(2)/4 entail dramatic classification accuracy improvement of 0.11 and 0.23 respectively even though the true predictor, PC3, accounts for only 7.5% of the total variance in the data. However, this total variance is 539 and 7.5% of that amount still carries substantial amount of information and subsequent classification power. However, the power to detect effect sizes of log(2)/8 and log(2)/4 with variable having variance of 40.8 is almost 1 suggesting the inclusion of PC3 in subsequent analyses.

DISCUSSION
In this work we show a potential performance problem of classification algorithms carried out after preliminary dimensionality reduction step via PCA. These scenarios can occur even in simple, low dimensional data cases as our numerical example reveals. However, the issue can regularly arise with higher dimension data that possesses complex patterns and multiple groups. In such cases, the main PCs capture the covariance pattern of combined data while the the lower ranked PCs capture the information about group differences and are therefore vital for classification accuracy. Our results show that PCA with inclusion thresholds based on proportion of total variance explained often decreases classification accuracy even with extremely high inclusion threshold. Thus, we suggest using all PCs in classification problem in order to avoid the omission of PCs with lower ranking that are important classification predictors. In such cases, the benefit of the not using the original variables and switching to PCA might come from the fact that the PCs are uncorrelated and that might be advantagous in certain model building algorithms.