LINEAR AND SUPPORT VECTOR REGRESSIONS BASED ON GEOMETRICAL CORRELATION OF DATA

Linear regression (LR) and support vector regression (SVR) are widely used in data analysis. Geometrical correlation learning (GcLearn) was proposed recently to improve the predictive ability of LR and SVR through mining and using correlations between data of a variable (inner correlation). This paper theoretically analyzes prediction performance of the GcLearn method and proves that GcLearn LR and SVR will have better prediction performance than traditional LR and SVR for prediction tasks when good inner correlations are obtained and predictions by traditional LR and SVR are far away from their neighbor training data under inner correlation. This gives the applicable condition of GcLearn method.


INTRODUCTION
In multivariate data analysis, linear regression (LR) and support vector regression (SVR) (Lin, 2001;Meyer et al., 2003;Zhou et al., 2006;Chang & Lin, 2006) are widely used to analyze the relationship between a dependent variable and a set of independent variables, and a regression model between variables by LR and SVR methods may be used for prediction tasks (Chen et al., 2004;Wagner et al., 2005;Sinnakaudan et al., 2006).When a regression model is constructed for 1-dimensional continuous variables, dependency relations among variables are the focus, but correlations between data of a variable are usually neglected.It is expected to improve predictive ability of LR and SVR methods by using this correlation information.
Let Y be the response variable of some 1-dimensional dependent variables and have n data.The space relation of the n data points in 1-dimensional coordinates y, i.e., T={y 1 , y 2 , y 3 , …, y n } where y 1 ≤y 2 ≤y 3 ≤…≤y n , is called value correlation of the data, while the neighbor relation between the n data points with some varying trends is called trend correlation.The value correlation indicates which data points are close each other in 1-dimensional coordinates, and the trend correlation is usually from additional information and prior knowledge about correlations, e.g., the trend correlation is time relation if the n data points are time series and vary with time.Both value correlation and trend correlation are called inner correlations, and an inner correlation represented by a geometric entity is called a geometrical correlation.
Recently, geometrical correlation learning (GcLearn) (Wang et al., 2007) has been proposed to mine and use the inner correlations inherent in data for LR and SVR methods.GcLearn method can improve the predictive ability of LR and SVR methods, as it makes use of additional information -inner correlations of data, while traditional LR and SVR methods do not.First, GcLearn finds inner correlation T from data of the response variable.According to T, GcLearn projects the 1-dimensional data of each variable to a 2-dimensional smooth curve (called a curve manifold), which is approximated by piecewise quadratic polynomial curves.Thus, a curve manifold represents data of a variable and shows their geometrical correlation.Then, a regression model for variables is found by the designed geometric regression method and by minimizing the fitting error of the model to curve manifolds.Finally, the optimal regression model F is found through an optimization process for piecewise curve approximation.When we predict with model F, test datum x of predictor variable X is also projected to x' on the curve manifold of X (it is similar for other predictor variables), and then the x' is used as the input to model F for predictions.
In this paper, we theoretically analyze the prediction performance of GcLearn LR and SVR methods and prove that the GcLearn LR/SVR method gives better prediction performance than the traditional LR/SVR method when good inner correlations are obtained and predictions from a LR/SVR model are bad.The GcLearn method from our previous work is introduced in Section 2, and the theoretical analysis of its prediction performance is in Section 3. Finally, the conclusion is given in Section 4.

GCLEARN LINEAR AND SUPPORT VECTOR REGRESSIONS
The GcLearn method (Wang et al., 2007) is briefly introduced in this section.The GcLearn method includes finding inner correlations, projecting data to curve manifolds as per inner correlations, finding a regression model from curve manifolds and predicting with the found model.
The first work is to find inner correlations and project the data of each variable to a 2-dimensional smooth curve M so that the M represents the data and their geometrical correlations.The value correlations are mined from data while the trend correlations of the data depend on or are gained by prior knowledge about correlations, e.g., the trend correlation is time relation if the n data points are time series and vary with time.Here we introduce the whole procedure by an example of value correlations, and it is similar for trend correlations.It is hard to make one curve for all the data of a variable reasonably and accurately in all cases, so we approximate the M via piecewise curves; each fit to partial data, including data division, making of a piecewise curve fitting to a data group, and connections of piecewise curves.The construction of such M is discussed within the framework of manifold theory (Chen, 2001).
Let variable Y be the response variable of q 1-dimensional continuous dependent variables and have n data points.The value correlation is obtained when the n data points are placed in 1-dimensional coordinates y, i.e., T={y 1 , y 2 , y 3 , …, y n } where y 1 ≤y 2 ≤y 3 ≤…≤y n .Then, natural numbers are adopted to represent T for constructing M, i.e., T={1,2,3, …, n} instead of and corresponding to {y 1 , y 2 , y 3 , …, y n }.Let the n data be divided equally into k groups each with m=n/k data along T (e.g., second group {y m+1 , y m+2 , …, y m+m }) and each data group be called local data.To simplify discussion, the notation y j ∈R (j=1,2,…,m) is adopted for each data group.
Then, GcLearn projects the n data points onto a 2-dimensional smooth curve M (called curve manifold) (Chen, 2001) in space yOt (coordinates t is used to denote T): 1P : { } : where local region U i of M is described by a piece of quadratic polynomial curve C i : y=ƒ i (u) (u∈[1,m]) under local coordinate system yOu (coordinates u are used to denote a part of T for local data).The piecewise curve y=ƒ i (u) fits to the data of ith group (m pairs of {y j ,u j } where u j =j and j=1,2,…,m) and is found by a least-squares fit (Mo & Liu, 2003;Wolfram, 2007).The united local regions need to be connected smoothly so that their joints are continuous and smooth, but in real applications, we do not do this difficult work on account of slight influence on outcomes.The next step is to design a geometric regression method to construct a regression model with curve manifolds, which is discussed by example of the linear regression between two variables.The least-squares fit method combined with integral computation of curve manifolds for solving a regression equation is called as geometric regression method.Let there be a linear regression equation for predictor variable X and response variable Y, and let same coordinate system xOu be used for local regions of curve manifold X M and the same yOu for Y M .
Then, any point of a local region of X M is (x,u) and any point of Y M is (y,u).( ) The least-squares fit method is used to solve for the coefficients of This theorem tells how to solve for unknown coefficients of a linear regression equation with curve manifolds.
The geometric regression of SVR has a similar principle, and the integral value of every local region instead of original data is used as input data for traditional SVR.
The final step is to find an optimal regression model F by the goodness of model fitting to curve manifolds.A geometrization parameter v (v=1,2,…,n) is defined as the unified length of every local region, and then k=n/v is the number of divided data groups.The length of every local curve is set to be 3v, and the central 1/3 part of a local curve is taken as the local region of a curve manifold (called central cutting).The central cutting makes neighbor data groups overlapping, but there is no overlapping between neighbor local regions.This optimization problem can be solved by finding an optimal v (v=5~14 is the proper searching scope as per our experiences) to minimize the residual sums of squares of model fitting: arg min ( ) (10) With the optimal parameter v 0 , the optimal curve manifolds are constructed, and then the optimal regression model F is found with the optimal curve manifolds.
When we do a prediction, the above work is done under a uniform value correlation T for both test and training data.The uniform T is established by using prediction values of Y (other than test and training data of Y), which are derived from the regression model by traditional LR/SVR.Then, under the T, all the data are projected to curve manifolds, and the optimal regression model F is found with the parts of curve manifolds corresponding to training data.For example, the curve manifold X M is constructed for both test and training data of X according to T. Thus, the test datum x is also projected to x' on the X M (we say that the x' is on X M instead of (x', u) on X M for simplification), and then the x' is used as input of model F and the output (e.g., 0 ' ' ) is its prediction.
The experiments on artificial and real data are performed to evaluate prediction performance of GcLearn LR and ε-SVR (SVR with radial basis kernels) methods based on 10-fold cross validation and mean squared errors.For the linear and nonlinear artificial data sets with different Gaussian noise variances from 0.2 to 2.2, the experimental results show that GcLearn LR and ε-SVR reduce prediction errors by about 45% -80% compared with traditional LR and ε-SVR.For benchmark real-world data sets (the pyrim, servo, auto-price, cpu, and auto-mpg) from UCI database (Blake & Merz, 1998), the experimental results show that GcLearn LR and ε-SVR reduce prediction errors by about 4% -46% compared with traditional LR and ε-SVR.

PERFORMANCE ANALYSIS
For a performance analysis of GcLearn LR and SVR methods compared with traditional LR and SVR, we have theorem 3.1 for the lowest performance and theorem 3.2 for better performance of GcLearn LR and SVR methods when prediction errors are used for performance evaluation.Proofs for both theorems are given in Appendices 7.1 and 7.2 respectively.
The performance of (GcLearn) LR/SVR method is defined to be predictive ability of a regression model by (GcLearn) LR/SVR method.The predictive ability is assessed by prediction errors, the mean squared error between the prediction values of a regression model on test data, and the true values of test data.Thus, the performance analysis refers to the comparison of predictive ability between the GcLearn LR/SVR method and the traditional LR/SVR method.
It is known that a local region U i of M corresponds to a piece of quadratic polynomial curve C i : y=ƒ i (u) Data Science Journal, Volume 6, 29 September 2007 (u∈[1,m]) in yOu, and the local curve y=ƒ i (u) fits to ith data group with m pairs of {y j ,u j }.The projection of a data group to a local region of M is called local projection.Corresponding to the minimum parameter v=1, the minimal local projection is that every data group with 3v=3 data is fitted to a local curve C i , whose central 1/3 part with a length of 1 is taken as local region U i of M.
Under the minimal local projection, three cases appear as follows.First, the corresponding projection points (y j ,u j ) on U i /C i are the same as the original data pair {y j ,u j } (j=1,2,3) of ith data group on account of the principle that a curve is determined by three points in a plane, so the function of the geometrical correlation disappears.Second, the integral value of the central part of curve C i is approximately equal to original data y 2 , so the geometric regression method is almost the same as traditional regression method.Third, the following theorem 3.1 states that GcLearn LR and traditional LR methods give approximately same regression equation.
Theorem 3.1 (Minimal geometric regression) Let there be a linear dependent relation for q 1-dimensional continuous variables.If the geometric regression is performed based on minimal local projection, the resulting regression equation for q variables is approximately the same as that found by the traditional linear regression method.
Therefore, the GcLearn LR and traditional LR methods have almost the same performance under the minimal local projection owing to above three cases or reasons, which indicates the lowest performance of the GcLearn LR and SVR methods.The conclusion is the same for GcLearn SVR under the minimal local projection, as there are same reasons mentioned above, and both the GcLearn SVR and traditional SVR use the same SVR procedure and method to infer a regression model.
The better performance of GcLearn under the larger local projection (v>1) will be discussed as follows.It is known that GcLearn uses inner correlations between test and training data.Both the test and training data are projected to the same curve manifolds, so the test and training data are correlated through the geometrical correlation or curve manifolds.Under larger local projection, the geometrical correlation will take effect, and the test datum x of predictor variable X are projected to x' on the X M (we say that the x' is on X M instead of (x', u) on X M for simplification), and then the x' is used as input of a regression model.
Let there be good inner correlation T between test data and training data (test data are close to some training data under T).It is expected for GcLearn LR that the prediction by GcLearn LR model on a test datum is on the Z M of response variable Z (this is the first condition), which means that this prediction is close to its neighbor training data under T so that this prediction tends to (nearby) its true value.The second condition is that traditional LR method gives the prediction of the same test datum far away from its neighbor training data under T, which means that this prediction is far away from its true value.When the two conditions are satisfied, we can conclude that GcLearn LR method has better prediction performance than traditional LR method.
The following theorem 3.2 states that GcLearn will give its prediction z' k on Z M under the good inner correlation T even if test datum x k of X is far away from X M , which indicates that GcLearn LR satisfies the first condition.
Therefore, by adding theorem 3.2 to the second condition, we conclude that GcLearn LR method will have better prediction performance than traditional LR method when good inner correlations are obtained and the predictions by traditional LR method are far away from their neighbor training data under correlation T.
Theorem 3.2 (Geometric projection).Let a linear dependent relation be given for q 1-dimensional continuous variables, each with n+1 data points and a good inner correlation T. Let curve manifolds X M and Z M be constructed under T for predictor variable X and response variable Z respectively.Let there be a test datum

CONCLUSION
The theoretical analysis demonstrates that the GcLearn LR and SVR methods will have better prediction Data Science Journal, Volume 6, 29 September 2007 performance than traditional LR and SVR methods for prediction tasks when good inner correlations are obtained and the prediction results by traditional LR and SVR methods are bad or far away from their neighbor training data under the inner correlations.This conclusion also indicates the application condition of the GcLearn method.

Proof of Theorem 3.1
Proof.Let n groups of data D i = 1 { ,..., } i i q x x (i=1,…, n) corresponding to q variable X 1 , …, X q be given and correspond to their inner correlations T={1,2,3,…,n}.Using these data, we can find a linear regression equation of the q variables under least square regression error (or traditional linear regression method): 1 ( ,..., ) 0 q X X f = . (11) Without losing generality, let q=2 and the linear regression equation between response variable Y and variable X be L 1 : 0 y x φ φ = + .
(12) By the method of least-squares fit, the coefficients of regression equation L 1 may be found with n pairs of data where y denotes the mean of i y and x the mean of i x .Then, we have Being different from the above traditional method of using the statistic of n pairs of data D i , the geometric regression uses curve manifolds and the integral of curves.Let the same integral region [u 0 , u 0 +2] be adopted to cover three neighbor data D i-1 , D i and D i+1 .As the central cutting of a local curve for overlapping data divisions, the 0 1 ( ) i i x g u = + may be used as the approximation of any x on ( ) (21) Thus, L 3 is approximately the same as L 2 .
the linear regression equation between variables X and Y (their curve manifolds are X M and Y M ).Let every curve manifold comprise k local regions and the local region of and l=1~k.And the means of x values of X M and the means of y values of Y M are set as follows: x k and n training data points {x i } (i=1~n+1, i≠k, n>3) for variable X.If the x k far away from X M makes the prediction z k by traditional linear regression far away from Z M , GcLearn linear regression will give its prediction z' k on Z M .The conclusion is the same for GcLearn SVR, as GcLearn projects the curve manifold X manifolds of training data and transforms the x k deviating from X M to the x' k on X M , so the prediction z' k by ( consider the geometric regression method under minimal local projection.Let k=n data divisions of variable X under T be ready under minimal local projection.Then every three neighbors 1 a u b u c = = in the local coordinate system xOu, where correlation coordinates u∈{u 0 , u 0 +1, u 0 +2}.Then, k local curves are joined to form manifold M X according to correlation T. Similarly, local quadratic polynomial curve ( ) i y h u = and curve manifold M Y are ready for variable Y.

L 3 :
be described by k=n local curves.Then the coefficients of regression equation L 1 may be found as per theorem 2Y= φ X+ 0 φ .