On the Application of Principal Component Analysis to Classification Problems

Jianwei Zheng; Cyril Rakovski

Introduction

Principal Component Analysis (PCA) (; ; ; ) is a popular tool for data dimensionality reduction in the presence of complex correlation structure among a large number of numerical variables. The presence of correlations among the original variables in the data can be used to create new summary variables, principal components (PCs), that are optimal, uncorrelated linear combinations of the original variables. The optimality is represented by the fact that the PCs have the maximum possible variance among all linear combinations of the original variables and thus contain the maximum amount of information. The lack of correlation among the PCs removes the redundancy present in the original variables. The well-known lemma for maximization of quadratic forms for points on the unit sphere shows that the vectors of coefficients that define the PCs are the eigenvectors of the variance matrix. The eigenvalues associated with the eigenvectors equal the variance of the PCs and define an order among all PCs. The ones with the largest variance are considered the main PCs and provide an scheme for dimensionality reduction, and we take the first few PCs that jointly account for more than 80% or 90% of the total variance of the original variance. This approach makes intuitive sense as the PCs associated with the smallest eigenvalues are almost constant and thus have limited classification capability. However, in certain problems dimensionality reduction via PCA with even high cutoff for exclusion is not a good idea. This phenomenon was noticed when we implementing an arrhythmia classification on ECG data, even though some of studies demonstrated the PCA application on same research (, ; ; , , ). The ECG graph of a normal beat (shown in Figure 1) consists of a sequence of waves, a P-wave presenting the atrial depolarization process, a QRS complex denoting the ventricular depolarization process, and a T-wave representing the ventricular repolarization. Our data consisted of 200 data points per heart beat with complex correlation structure that seemed ideal for preliminary PCA dimensionality reduction step before subsequent classification approach was employed. However, using PCA exclusion cutoffs of 90%, 92%, 95%, 99% for the 200 PCs dramatically improves classification accuracy rate. The PCA application processed a segment of ECG presented one time heartbeat is depicted in Figure 2. This is an example revealing that PCA may not be a good idea for certain types of classification problems. A more detailed results that highlight this finding are shown in Table 1. We can see that the loss of classification accuracy using five common classification algorithms (random forest, conditional random forest, naive Bayes, multinomial logistic regression, and quadratic discriminant analysis) using the original ECG data and principal components accounting for 99% of the total variance was between 0.001 and 0.06. In subsequent presentation we show that omission of even the lowest ranked PCs can be disadvantageous to the classification accuracy of the algorithm.

Figure 1

The ECG waveform and segments in lead II that presents a normal cardiac cycle.

Figure 2

One heartbeat ECG presented by 100%, 20%, 40%, and 60% respectively.

Table 1

Accuracy* comparison between classification models using original variables and principal components**.


CLASSIFIER NAME	NON PCA	PCA**	THE DIFFERENCE

Random Forest	0.96	0.92	–0.04

Conditional Random Forest	0.96	0.90	–0.06

Naive Bayes	0.92	0.87	–0.05

Multinomial Logistic Regression	0.94	0.94	–0.001

Quadratic Discriminant Analysis	0.93	0.90	–0.02

* Accuracy is the average of 10 stratified folds.

** Principal components accounting for 99% of the variance used.

Methods

Here is a mathematical description of data scenarios where this phenomenon can occur. Let Σ be the covariance matrix of the original variables x₁, x₂, …, x_p and (λ₁, e₁), (λ₂, e₂) …, (λ_p, e_p) be the eigenvalue-eigenvector pairs where λ₁ ≥ λ₂ ≥ … ≥ λ_p. Then, the PCs are $y 1 = e 1 T x 1, y 2 = e 2 T x 2, …, y p = e p T x p$ M9 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {y_1} = e_1^T{x_1},{y_2} = e_2^T{x_2},\,\, \ldots ,\,\,{y_p} = e_p^T{x_p} \] \end{document} . The classical approach () for dimensionality reduction is to select the first s major PCs that jointly account for at least, say m * 100% of the total variance of the original variables,

(1)

s = mi n 1 ≤ k ≤ p λ 1 + λ 2 + … + λ k λ 1 + λ 2 + … + λ p ⩾ m .

M1 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ s = mi{n_{1 \le k \le p}}\frac{{{\lambda _1} + {\lambda _2} + \ldots + {\lambda _k}}}{{{\lambda _1} + {\lambda _2} + \ldots + {\lambda _p}}}\geq m. \] \end{document}

Now assume that we have a classification problem with two groups. Let G_i, i = 1, 2, …, n be dichotomous variables that denote the group classification. Assume that the true underlying model describing the associations between G_i and y_i₁, y_i₂, …, y_ip are given by the following logistic model,

(2)

Logit (P (G i = 1 | y i 1, y i 2, …, y ip)) = β 0 + β 1 y is + 1 + β 2 y is + 2 + … + β j y is + j,

M2 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ Logit(P({G_i} = 1|{y_{i1}},{y_{i2}}, \ldots ,{y_{ip}})) = {\beta _0} + {\beta _1}{y_{is + 1}} + {\beta _2}{y_{is + 2}} + \ldots + {\beta _j}{y_{is + j}}, \] \end{document}

where β₀, β₁, …, β_j are the true effect sizes and 1 ≤ j ≤ p – s. It is clear that under these conditions, the classification will be poor due to the exclusion of the true predictors from the data at the preprocessing step of dimensionality reduction. That omission entails low classification accuracy based on spurious association between the group and noise variables or no detectable classification capability at all.

Therefore, in its classical dimensionality reduction implementation, PCA, might not be useful for certain classification problems. In particular, in classification problems with complex patterns the lower ranked PCs are the ones that carry the information about group differences as the first several main PCs that reflect the correlation structure of the complex mean pattern and do not contain enough information about subtle group differences. Thus, if PCA is employed, we recommend that the PC inclusion thresholds should be carefully considered and based not only on the proportion of explained variance but also on the magnitude of the variance of the excluded PCs and the power to detect effect size of certain magnitude given the sample size (; ). In particular, if we consider y_s₊₁ = (y₁_s₊₁, y₂_s₊₁, …, y_{n s}₊₁) (with variance λ_s₊₁) for inclusion in subsequent analysis where the first l and subsequent n – l subjects belong to groups 1 and 2 respectively. Let π(δ) denote the power to detect a difference of size δ between the group means subject to the restriction imposed by the fixed variance of the (s+1)-th PC. We will show that π(δ) can be arbitrarily close to 1. It is clear that,

(3)

π (δ) = Φ 2 l (n − l) δ n (σ 12 + σ 22) − z 1 − α / 2 : (n − 1) λ s + 1 = ∑ i = 1 n (y is + 1 − y ¯ s + 1) 2,

M3 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \pi (\delta ) = \Phi \left( {\sqrt {\frac{{2l(n - l)\delta }}{{n(\sigma _1^2 + \sigma _2^2)}}} - {z_{1 - \alpha /2}}:(n - 1){\lambda _{s + 1}} = \sum\limits_{i = 1}^n {{{({y_{is + 1}} - {{\bar y}_{s + 1}})}^2}} } \right), \] \end{document}

where $σ 12, σ 22$ M10 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \sigma _1^2,\sigma _2^2 \] \end{document} are the variances of two groups, z_1–α/2 is (1 – α/2)100 – th percentile of the standard normal distribution, $y ¯ s + 1$ M11 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\bar y_{s + 1}} \] \end{document} is the mean of vector y_s₊₁, and Ф is the cumulative density function of the standard normal distribution.

The ANOVA decomposition of the total sums of squares yields,

(4)

(n − 1) λ s + 1 = l (y ¯ s + 1 ′ − y ¯ s + 1) 2 + (n − l) (y ¯ ″ s + 1 − y ¯ s + 1) 2 + ∑ l i = 1 (y i s + 1 − y ¯ s + 1 ′) 2 + ∑ n j = l + 1 (y j s + 1 − y ¯ ″ s + 1) 2,

M4 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ (n - 1){\lambda _{s + 1}} = l{(\bar y{^\prime _{s + 1}} - {\bar y_{s + 1}})^2} + (n - l){(\bar y{^{\prime\prime}_{s + 1}} - {\bar y_{s + 1}})^2} + \sum\limits_{i = 1}^l ( {y_{i{\mkern 1mu} s + 1}} - \bar y{^\prime _{s + 1}}{)^2} + \sum\limits_{j = l + 1}^n ( {y_{j{\mkern 1mu} s + 1}} - \bar y{^{\prime\prime}_{s + 1}}{)^2}, \] \end{document}

where $y ¯ ′ s + 1, y ¯ ″ s + 1$ M12 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\bar y}{^\prime}_{s + 1},{\bar y}{^{\prime\prime}_{s + 1}} \] \end{document} , and $y ¯ s + 1$ M18 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\bar y_{s + 1}} \] \end{document} are the means in the first, second and entire sample respectively.

Letting $σ 12 → 0$ M13 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \sigma _1^2 \to 0 \] \end{document} and $σ 22 → 0$ M14 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \sigma _2^2 \to 0 \] \end{document} entails $y is + 1 → y ¯ s + 1 ′$ M15 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {y_{is + 1}} \to {\bar y^\prime_{s + 1}} \] \end{document} for all i = 1, 2, …, l and $y js + 1 → y ¯ s + 1 ″$ M16 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {y_{js + 1}} \to \,{\bar y{^{\prime\prime}_{s + 1}}} \] \end{document} for all j = l + 1, l + 2, …, n. Then,

(5)

l (y ¯ s + 1 ′ − y ¯ s + 1) 2 + (n − l) (y ¯ s + 1 ″ − y ¯ s + 1) 2 → (n − 1) λ s + 1,

M5 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ l{\left( {\bar y{^{\prime} _{s + 1}} - {{\bar y}_{s + 1}}} \right)^2} + \left( {n\,\, - \,\,l} \right){\left( {\bar y{^{\prime\prime}_{s + 1}} - {{\bar y}_{s + 1}}} \right)^2} \to \left( {n - 1} \right){\lambda _{s + 1}}, \] \end{document}

Without loss of generality we can assume that the overall mean $y ¯ s + 1$ M19 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\bar y_{s + 1}} \] \end{document} is zero and that the means of the first group and second groups are d₁ and –d₂. Then, from the condition that the overall mean is zero and (5) we deduce that d₂ = d₁l/(n – l) and $d 1 = (n − 1) (n − l) λ s + 1 / (nl)$ M17 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {d_1} = \sqrt {(n - 1)(n - l){\lambda _{s + 1}}/(nl)} \] \end{document} . From here,

(6)

δ = d 1 + d 2 = n (n − 1) λ s + 1 n − l,

M6 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \delta = {d_1} + {d_2} = \sqrt {\frac{{n(n - 1){\lambda _{s + 1}}}}{{n\,\, - \,\,l}}} , \] \end{document}

which is always positive. Clearly, from (3) we get,

(7)

π (δ) → σ 12, σ 22 → 0 1.

M7 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \pi(\delta)\mathop{\longrightarrow}_{\sigma_1^2,\sigma _2^2 \rightarrow 0} 1. \] \end{document}

This result reveals that any principal component with arbitrarily small variance can have a statistically significant effect with respect to classification which can produce subsequent improvement in the area under the ROC curve and should not be disregarded without further investigation.

Results

We highlight the results through a numerical example. The following positive definite covariance matrix,

(8)

R = 23713490104134866871906811839104713998

M8 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ R = \left[ {\begin{array}{*{20}{c}} {237}&{134}&{90}&{104}\\ {134}&{86}&{68}&{71}\\ {90}&{68}&{118}&{39}\\ {104}&{71}&{39}&{98} \end{array}} \right] \] \end{document}

has eigenvalues 419.3, 75.8, 40.8, 3.1 and the first two PCs account for 91.9% of the total variance. The usual dimensionality reduction approach will use the first two PCs for further analysis and disregard the last two. Let the true model for the binary class assignment be given by Logit(P(G_i = 1)) = 0.5 + β₁y_i₃. For effect sizes β₁ = log(2)/4, log(2)/2, log(2), 2 the average areas under the ROC curve (averaged over 10,000 simulated datasets containing 500 subjects) for a logistic regression model that uses PC1 and PC2 were 0.53, 0.54, 0.54, 0.55 and while the corresponding values for a model using PC3 were 0.76, 0.88, 0.95, 0.99. Summary of the results is shown in Table 2.

Table 2

Areas under the ROC curve for both models.


β1	AUC – PC1, PC2*	AUC – PC3*

log(2)/8	0.53	0.64

log(2)/4	0.53	0.76

log(2)/2	0.54	0.88

log(2)	0.54	0.95

2	0.55	0.99

* Empirically estimated via 10,000 datasets.

It is clear that even the two smallest effect sizes of log(2)/8 and log(2)/4 entail dramatic classification accuracy improvement of 0.11 and 0.23 respectively even though the true predictor, PC3, accounts for only 7.5% of the total variance in the data. However, this total variance is 539 and 7.5% of that amount still carries substantial amount of information and subsequent classification power. However, the power to detect effect sizes of log(2)/8 and log(2)/4 with variable having variance of 40.8 is almost 1 suggesting the inclusion of PC3 in subsequent analyses.

Discussion

In this work we show a potential performance problem of classification algorithms carried out after preliminary dimensionality reduction step via PCA. These scenarios can occur even in simple, low dimensional data cases as our numerical example reveals. However, the issue can regularly arise with higher dimension data that possesses complex patterns and multiple groups. In such cases, the main PCs capture the covariance pattern of combined data while the the lower ranked PCs capture the information about group differences and are therefore vital for classification accuracy. Our results show that PCA with inclusion thresholds based on proportion of total variance explained often decreases classification accuracy even with extremely high inclusion threshold. Thus, we suggest using all PCs in classification problem in order to avoid the omission of PCs with lower ranking that are important classification predictors. In such cases, the benefit of the not using the original variables and switching to PCA might come from the fact that the PCs are uncorrelated and that might be advantagous in certain model building algorithms.

Data Science Journal

Research Papers

On the Application of Principal Component Analysis to Classification Problems

Abstract

Introduction

Methods

Results

Discussion

Acknowledgements

Competing Interests

References