A- A+
Alt. Display

# On the Application of Principal Component Analysis to Classification Problems

## Abstract

Principal Component Analysis (PCA) is a commonly used technique that uses the correlation structure of the original variables to reduce the dimensionality of the data. This reduction is achieved by considering only the first few principal components for a subsequent analysis. The usual inclusion criterion is defined by the proportion of the total variance of the principal components exceeding a predetermined threshold. We show that in certain classification problems, even extremely high inclusion threshold can negatively impact the classification accuracy. The omission of small variance principal components can severely diminish the performance of the models. We noticed this phenomenon in classification analyses using high dimension ECG data where the most common classification methods lost between 1 and 6% of accuracy even when using 99% inclusion threshold. However, this issue can even occur in low dimension data with simple correlation structure as our numerical example shows. We conclude that the exclusion of any principal components should be carefully investigated.

Keywords:
How to Cite: Zheng, J. and Rakovski, C., 2021. On the Application of Principal Component Analysis to Classification Problems. Data Science Journal, 20(1), p.26. DOI: http://doi.org/10.5334/dsj-2021-026
Published on 18 Aug 2021
Accepted on 09 Aug 2021            Submitted on 13 Dec 2020

## Introduction

Principal Component Analysis (PCA) (Du et al., 2012; Hsieh et al., 2010; Mehmet Korürek, 2010; Kim et al., 2009) is a popular tool for data dimensionality reduction in the presence of complex correlation structure among a large number of numerical variables. The presence of correlations among the original variables in the data can be used to create new summary variables, principal components (PCs), that are optimal, uncorrelated linear combinations of the original variables. The optimality is represented by the fact that the PCs have the maximum possible variance among all linear combinations of the original variables and thus contain the maximum amount of information. The lack of correlation among the PCs removes the redundancy present in the original variables. The well-known lemma for maximization of quadratic forms for points on the unit sphere shows that the vectors of coefficients that define the PCs are the eigenvectors of the variance matrix. The eigenvalues associated with the eigenvectors equal the variance of the PCs and define an order among all PCs. The ones with the largest variance are considered the main PCs and provide an scheme for dimensionality reduction, and we take the first few PCs that jointly account for more than 80% or 90% of the total variance of the original variance. This approach makes intuitive sense as the PCs associated with the smallest eigenvalues are almost constant and thus have limited classification capability. However, in certain problems dimensionality reduction via PCA with even high cutoff for exclusion is not a good idea. This phenomenon was noticed when we implementing an arrhythmia classification on ECG data, even though some of studies demonstrated the PCA application on same research (Gupta and Mittal, 2019b, 2018b; Gupta et al., 2020; Gupta and Mittal, 2018a, 2016, 2019a). The ECG graph of a normal beat (shown in Figure 1) consists of a sequence of waves, a P-wave presenting the atrial depolarization process, a QRS complex denoting the ventricular depolarization process, and a T-wave representing the ventricular repolarization. Our data consisted of 200 data points per heart beat with complex correlation structure that seemed ideal for preliminary PCA dimensionality reduction step before subsequent classification approach was employed. However, using PCA exclusion cutoffs of 90%, 92%, 95%, 99% for the 200 PCs dramatically improves classification accuracy rate. The PCA application processed a segment of ECG presented one time heartbeat is depicted in Figure 2. This is an example revealing that PCA may not be a good idea for certain types of classification problems. A more detailed results that highlight this finding are shown in Table 1. We can see that the loss of classification accuracy using five common classification algorithms (random forest, conditional random forest, naive Bayes, multinomial logistic regression, and quadratic discriminant analysis) using the original ECG data and principal components accounting for 99% of the total variance was between 0.001 and 0.06. In subsequent presentation we show that omission of even the lowest ranked PCs can be disadvantageous to the classification accuracy of the algorithm.

Figure 1

The ECG waveform and segments in lead II that presents a normal cardiac cycle.

Figure 2

One heartbeat ECG presented by 100%, 20%, 40%, and 60% respectively.

Table 1

Accuracy* comparison between classification models using original variables and principal components**.

CLASSIFIER NAME NON PCA PCA** THE DIFFERENCE

Random Forest 0.96 0.92 –0.04

Conditional Random Forest 0.96 0.90 –0.06

Naive Bayes 0.92 0.87 –0.05

Multinomial Logistic Regression 0.94 0.94 –0.001

Quadratic Discriminant Analysis 0.93 0.90 –0.02

* Accuracy is the average of 10 stratified folds.

** Principal components accounting for 99% of the variance used.

## Methods

Here is a mathematical description of data scenarios where this phenomenon can occur. Let Σ be the covariance matrix of the original variables x1, x2, …, xp and (λ1, e1), (λ2, e2) …, (λp, ep) be the eigenvalue-eigenvector pairs where λ1 ≥ λ2 ≥ … ≥ λp. Then, the PCs are ${y}_{1}={e}_{1}^{T}{x}_{1},{y}_{2}={e}_{2}^{T}{x}_{2}, \dots , {y}_{p}={e}_{p}^{T}{x}_{p}$. The classical approach (Johnson and Wichern, 1988) for dimensionality reduction is to select the first s major PCs that jointly account for at least, say m * 100% of the total variance of the original variables,

(1)
$s=\mathrm{mi}{n}_{1\le k\le p}\frac{{\lambda }_{1}+{\lambda }_{2}+\dots +{\lambda }_{k}}{{\lambda }_{1}+{\lambda }_{2}+\dots +{\lambda }_{p}} ⩾ m.$

Now assume that we have a classification problem with two groups. Let Gi, i = 1, 2, …, n be dichotomous variables that denote the group classification. Assume that the true underlying model describing the associations between Gi and yi1, yi2, …, yip are given by the following logistic model,

(2)
$\mathrm{Logit}\left(P\left({G}_{i}=1|{y}_{i1},{y}_{i2}, \dots ,{y}_{\mathrm{ip}}\right)\right)={\beta }_{0}+{\beta }_{1}{y}_{\mathrm{is}+1}+{\beta }_{2}{y}_{\mathrm{is}+2}+\dots +{\beta }_{j}{y}_{\mathrm{is}+j},$

where β0, β1, …, βj are the true effect sizes and 1 ≤ jps. It is clear that under these conditions, the classification will be poor due to the exclusion of the true predictors from the data at the preprocessing step of dimensionality reduction. That omission entails low classification accuracy based on spurious association between the group and noise variables or no detectable classification capability at all.

Therefore, in its classical dimensionality reduction implementation, PCA, might not be useful for certain classification problems. In particular, in classification problems with complex patterns the lower ranked PCs are the ones that carry the information about group differences as the first several main PCs that reflect the correlation structure of the complex mean pattern and do not contain enough information about subtle group differences. Thus, if PCA is employed, we recommend that the PC inclusion thresholds should be carefully considered and based not only on the proportion of explained variance but also on the magnitude of the variance of the excluded PCs and the power to detect effect size of certain magnitude given the sample size (Schoenfeld D. A., 2005; F.Y. Hsieh and Larsen, 1998). In particular, if we consider ys+1 = (y1s+1, y2s+1, …, yn s+1) (with variance λs+1) for inclusion in subsequent analysis where the first l and subsequent n – l subjects belong to groups 1 and 2 respectively. Let π(δ) denote the power to detect a difference of size δ between the group means subject to the restriction imposed by the fixed variance of the (s+1)-th PC. We will show that π(δ) can be arbitrarily close to 1. It is clear that,

(3)
$\pi \left(\delta \right)=\Phi \left(\sqrt{\frac{2l\left(n-l\right)\delta }{n\left({\sigma }_{1}^{2}+{\sigma }_{2}^{2}\right)}}-{z}_{1-\alpha /2}:\left(n-1\right){\lambda }_{s+1}=\sum _{i=1}^{n}{\left({y}_{\mathrm{is}+1}-{\overline{y}}_{s+1}\right)}^{2}\right),$

where ${\sigma }_{1}^{2},{\sigma }_{2}^{2}$ are the variances of two groups, z1–α/2 is (1 – α/2)100 – th percentile of the standard normal distribution, ${\overline{y}}_{s+1}$ is the mean of vector ys+1, and Ф is the cumulative density function of the standard normal distribution.

The ANOVA decomposition of the total sums of squares yields,

(4)
$\left(n-1\right){\lambda }_{s+1}=l\left({\overline{y}}_{s+1}^{\prime }-{\overline{y}}_{s+1}{\right)}^{2}+\left(n-l\right)\left({\overline{y}}^{″}{}_{s+1}-{\overline{y}}_{s+1}{\right)}^{2}+\sum _{l}^{i=1}\left({y}_{is+1}-{\overline{y}}_{s+1}^{\prime }{\right)}^{2}+\sum _{n}^{j=l+1}\left({y}_{js+1}-{\overline{y}}^{″}{}_{s+1}{\right)}^{2},$

where ${\overline{y}}^{\prime }{}_{s+1},{\overline{y}}^{″}{}_{s+1}$, and ${\overline{y}}_{s+1}$ are the means in the first, second and entire sample respectively.

Letting ${\sigma }_{1}^{2}\to 0$ and ${\sigma }_{2}^{2}\to 0$ entails ${y}_{\mathrm{is}+1}\to {\overline{y}}_{s+1}^{\prime }$ for all i = 1, 2, …, l and ${y}_{\mathrm{js}+1}\to {\overline{y}}_{s+1}^{″}$ for all j = l + 1, l + 2, …, n. Then,

(5)
$l{\left({\overline{y}}_{{}_{s+1}}^{\prime }-{\overline{y}}_{s+1}\right)}^{2}+\left(n - l\right){\left({\overline{y}}_{{}_{s+1}}^{″}-{\overline{y}}_{s+1}\right)}^{2}\to \left(n-1\right){\lambda }_{s+1},$

Without loss of generality we can assume that the overall mean ${\overline{y}}_{s+1}$ is zero and that the means of the first group and second groups are d1 and –d2. Then, from the condition that the overall mean is zero and (5) we deduce that d2 = d1l/(nl) and ${d}_{1}=\sqrt{\left(n-1\right)\left(n-l\right){\lambda }_{s+1}/\left(\mathrm{nl}\right)}$. From here,

(6)
$\delta ={d}_{1}+{d}_{2}=\sqrt{\frac{n\left(n-1\right){\lambda }_{s+1}}{n - l}},$

which is always positive. Clearly, from (3) we get,

(7)
$\pi \left(\delta \right)\underset{{\sigma }_{1}^{2},{\sigma }_{2}^{2}\to 0}{\to }1.$

This result reveals that any principal component with arbitrarily small variance can have a statistically significant effect with respect to classification which can produce subsequent improvement in the area under the ROC curve and should not be disregarded without further investigation.

## Results

We highlight the results through a numerical example. The following positive definite covariance matrix,

(8)
$R=\left[\begin{array}{cccc}237& 134& 90& 104\\ 134& 86& 68& 71\\ 90& 68& 118& 39\\ 104& 71& 39& 98\end{array}\right]$

has eigenvalues 419.3, 75.8, 40.8, 3.1 and the first two PCs account for 91.9% of the total variance. The usual dimensionality reduction approach will use the first two PCs for further analysis and disregard the last two. Let the true model for the binary class assignment be given by Logit(P(Gi = 1)) = 0.5 + β1yi3. For effect sizes β1 = log(2)/4, log(2)/2, log(2), 2 the average areas under the ROC curve (averaged over 10,000 simulated datasets containing 500 subjects) for a logistic regression model that uses PC1 and PC2 were 0.53, 0.54, 0.54, 0.55 and while the corresponding values for a model using PC3 were 0.76, 0.88, 0.95, 0.99. Summary of the results is shown in Table 2.

Table 2

Areas under the ROC curve for both models.

β1 AUC – PC1, PC2* AUC – PC3*

log(2)/8 0.53 0.64

log(2)/4 0.53 0.76

log(2)/2 0.54 0.88

log(2) 0.54 0.95

2 0.55 0.99

* Empirically estimated via 10,000 datasets.

It is clear that even the two smallest effect sizes of log(2)/8 and log(2)/4 entail dramatic classification accuracy improvement of 0.11 and 0.23 respectively even though the true predictor, PC3, accounts for only 7.5% of the total variance in the data. However, this total variance is 539 and 7.5% of that amount still carries substantial amount of information and subsequent classification power. However, the power to detect effect sizes of log(2)/8 and log(2)/4 with variable having variance of 40.8 is almost 1 suggesting the inclusion of PC3 in subsequent analyses.

## Discussion

In this work we show a potential performance problem of classification algorithms carried out after preliminary dimensionality reduction step via PCA. These scenarios can occur even in simple, low dimensional data cases as our numerical example reveals. However, the issue can regularly arise with higher dimension data that possesses complex patterns and multiple groups. In such cases, the main PCs capture the covariance pattern of combined data while the the lower ranked PCs capture the information about group differences and are therefore vital for classification accuracy. Our results show that PCA with inclusion thresholds based on proportion of total variance explained often decreases classification accuracy even with extremely high inclusion threshold. Thus, we suggest using all PCs in classification problem in order to avoid the omission of PCs with lower ranking that are important classification predictors. In such cases, the benefit of the not using the original variables and switching to PCA might come from the fact that the PCs are uncorrelated and that might be advantagous in certain model building algorithms.

## Acknowledgements

We are grateful for the support of the Kay Family Foundation.

## Competing Interests

The authors have no competing interests to declare.

## References

1. Du, X, Dua, S, Acharya, RU and Chua, CK. 2012. Classification of epilepsy using high-order spectra features and principle component analysis. Journal of Medical Systems, 36: 1731–1743.

2. Gupta, V and Mittal, M. 2016. Respiratory signal analysis using pca, fft and artfa. 221–225. DOI: https://doi.org/10.1109/ICEPES.2016.7915934

3. Gupta, V and Mittal, M. 2018a. Knn and pca classifier with autoregressive modeling during different ecg signal interpretation. Procedia Computer Science, 125: 18–24.

4. Gupta, V and Mittal, M. 2018b. R-peak based arrhythmia detection using hilbert transform and principal component analysis. 1–4.

5. Gupta, V and Mittal, M. 2019a. Qrs complex detection using stft, chaos analysis, and pca in standard and real-time ecg databases. Journal of The Institution of Engineers (India): Series B, 100.

6. Gupta, V and Mittal, M. 2019b. R-peak detection in ecg signal using yule–walker and principal component analysis. IETE Journal of Research, 1–14.

7. Gupta, V, Mittal, M and Mittal, V. 2020. R-peak detection based chaos analysis of ecg signal. Analog Integrated Circuits and Signal Processing, 102. DOI: https://doi.org/10.1007/s10470-019-01556-1

8. Hsieh, C-W, Liu, T-C, Jong, T-L and Tiu, C-M. 2010. A fuzzy-based growth model with principle component analysis selection for carpal bone-age assessment. Medical & Biological Engineering & Computing, 48: 579–588.

9. Hsieh, FY, Bloch, AB and Larsen, MD. 1998. A simple method of sample size calculation for linear and logistic regression. Statistics in Medicine, 17: 1623–1634.

10. Johnson, RA and Wichern, DW. (eds.) 1988. Applied Multivariate Statistical Analysis. Upper Saddle River, NJ, USA: Prentice-Hall, Inc. DOI: https://doi.org/10.2307/2531616

11. Kim, J, Shin, HS, Shin, K and Lee, M. 2009. Robust algorithm for arrhythmia classification in ecg using extreme learning machine. BioMedical Engineering OnLine, 8: 31.

12. Mehmet Korürek, AN. 2010. Clustering mit–bih arrhythmias with ant colony optimization using time domain and pca compressed wavelet coefficients. Digital Signal Processing, 20: 1050–1060.

13. Schoenfeld, DA and Borenstein, M. 2005. Calculating the power or sample size for the logistic and proportional hazards models. Journal of Statistical Computation and Simulation, 75: 771–785.