A METAHEURISTIC REGRESSION-BASED FEATURE SELECTION FOR PREDICTIVE ANALYTICS

A high-dimensional feature selection having a very large number of features with an optimal feature subset is an NP-complete problem. Because conventional optimization techniques are unable to tackle large-scale feature selection problems, meta-heuristic algorithms are widely used. In this paper, we propose a particle swarm optimization technique while utilizing regression techniques for feature selection. We then use the selected features to classify the data. Classification accuracy is used as a criterion to evaluate classifier performance, and classification is accomplished through the use of k-nearest neighbour (KNN) and Bayesian techniques. Various high dimensional data sets are used to evaluate the usefulness of the proposed approach. Results show that our approach gives better results when compared with other conventional feature selection algorithms.


INTRODUCTION
The rise of advanced data gathering techniques in fields such as bioinformatics, sensor networks, and customer relationships has led to challenges in high dimensional data (Kriegel & Zimek, 2012).None of the large amounts of available data can be directly understood by analysers, researchers, or data scientists.Fortunately computational technologies, data mining, and machine learning algorithms are improving to keep up with this increase in data volume.For example, one problem found in the field of bioinformatics is high dimensional datasets, that is, data sets having a very large number of features or attributes (Kriegel, 2009;Ding, 2003).Gene microarray datasets are an example of this type of problem.For each tissue sample, a gene microarray captures gene expression levels for tens of thousands of gene probes.In practice, however, only a small handful of these genes are actually relevant to answering a specific underlying biological question.High dimensionality, i.e., a large numbers of features, is a major problem in data mining fields and consumes a large amount of computation time, affecting the quality of training datasets as well as classification models.Because of "the curse of dimensionality" (Verleysen, 2005), all significant techniques for predictive and descriptive analysis become insignificant with these data volumes (Houle, 2010).
In this paper, we address the problem of selecting an optimized number of features from a high dimensional data set.The process of feature selection can be described as a search in a state space.A number of approaches are possible.A heuristic search, for instance, considers unselected features for evaluation at each of a number of iteration steps.A random search, on the other hand, generates random subsets within the search space.Several bio-inspired and genetic algorithms use this approach.Each of these methods has its limitations.Here, we propose a new approach combining particle swarm optimization (PSO) with regression techniques to improve feature selection, which is measured by the performance of a classifier.
A problem in medical analysis illustrates the importance of feature selection.One typical medical dataset consists of patient observations, each containing m clinical characteristics (features).This m-dimensional dataset is a union of two disjoint sets.One represents a "positive" group associated with patients having a specific medical condition or disease.The other is a "negative" group that do not have that condition or disease.Medical diagnosis and prognosis have been shown to be improved by applying data classification and identifying significant features in such datasets in clinical settings (Hammer, 2006;Saastamoinen, 2006;Tsirogiannis, 2004).To improve medical diagnosis, data mining techniques can be used to identify a disease from its symptoms and make the decisions necessary to diagnose a patient.Similarly, data mining techniques may also be used in forecasting the probable outcome of a disease.Data mining here, however, is useful only if the selected features effectively identify a disease or correctly forecast a disease outcome.
As shown in Figure 1, the purpose of the feature selection is to find relevant and important features in the original dataset that are more significant than previously recognized (original) data patterns.We perform feature selection to reduce the size of the dataset and improve the computational performance of analytical methods.Feature selection is also an exceptionally effective and valuable technique to improve classification accuracy by reducing the number of irrelevant and redundant features and identifying those that are most important.If a dataset has a large number of features, the dimensions of the working data will be large, and the dataset will contain noisy, irrelevant, and redundant data resulting in the degradation of the predictive rate of the classifiers' accuracy.Therefore, an efficient and vigorous feature selection method is sought that reduces noisy, irrelevant, and redundant data.
Conventional mathematic statistical and analytical methods are often not able to analyze the complex systems of biological medicine and other fields.For example, analyzing high dimensional data in biomedical fields can produce vagueness, ambiguity, partial truths, and approximation (Zimek, 2012).To overcome this problem, Particle Swarm Optimization (PSO)-based approaches have previously been used in the selection of an optimized number of features (Agrafiotis, 2002;Fan 2010;Elbedwehy, 2012).This meta-heuristic feature selection technique can be used to eliminate noise and irrelevant and redundant data (Agrafiotis, 2002;Song, 2004), yet this technique is both challenging and less productive for high dimensional data.Soft Computing, which uses estimation, may be an alternative means to solving these problems.Some machine learning techniques are also able to tackle these datasets.
In this paper we describe a better technique for selecting significant features from high dimensional, scientific datasets.The main focus of this work is: 1. To identify important features effectively and efficiently 2. To increase the classification accuracy of identified features 3. To deal with irrelevant and redundant features to obtain a good feature subset 4. To keep only those features that are obtained after a double filtration process 5. To evaluate the accuracy of our feature selection method by comparison with other common feature selection algorithms (Naive Bayes & K-Nearest Neighbour) This paper proposes a Meta-Heuristic Regression Based Feature Selection approach for feature selection in high dimensional datasets.We used a regression model to establish the relationship between the number of features and classification accuracy by reducing the size of testing data and to verify whether a feature is selected.With the help of the regression model, this modified PSO approach can increase population diversity and improve global searching capability, thereby avoiding inaccurate convergence and growing population diversity in the PSO mechanism.In this paper, we have used the terms -features, dimensions, and attributes -interchangeably.We also use 'FS' as an abbreviation of feature selection.

High Dimension Dataset
Relevant Feature Generation

RELATED WORK
The process of feature selection is responsible for electing a subset of features that describe the important aspects of a dataset.Feature selection can be considered as a search into a state space.One can perform a full search in which all the space is traversed; however, this approach is impractical for a large number of features.
A heuristic search considers the features, not yet selected at each iteration, for evaluation.A random search generates random subsets within the search space that can be evaluated for importance.Several bio-inspired and genetic algorithms use this approach (Nakamura, 2012).
Feature selection methods can be classified into two main categories: filter approaches (Song, 2013) and wrapper approaches (Song, 2013).In filter based techniques, a filtering process is performed before the classification process; therefore, the selected features are independent of the used classification algorithm (Xue, 2012).A ranking value or weight value is computed for each feature, and those features with higher ranking or weight values with respect to a user defined threshold value are selected to represent the original data set.On the other hand, wrapper approaches make use of a learning process to select a subset of features by adding and removing features that maximize learning accuracy.Wrapper methods are usually more effective than filter methods.Kriegel (2009), Ding (2003), Agrafiotis (2002), and Elbedwehy ( 2012) have researched the feature selection problem.Genetic Algorithm (GA) and PSO are basic techniques that are meta-heuristic approaches (Bloomfield, 2010).Because PSO approaches converge more quickly and require less computational complexity, we have chosen to use PSO in our proposed feature selection approach.Agrafiotis and Cedeno (2002) first applied PSO for feature selection.They devised structure-property and structure-activity correlation models for computer assisted drug design, a common technique in the pharmaceutical industry to correlate biological activity with compounds properties by identifying key features.Kennedy (2001) used the phenomenon of a neighbours' population influence as particle swarms move around a search space in which a population of individuals has settled in stochastically toward previously successful regions.This method was initially proposed for probing multidimensional continuous datasets and applied to the feature selection by using the vector properties of the particles as probabilities.In their experimental analysis, the method compared favourably with simulated annealing techniques and identified an improved and more varied set of results, given the same amount of simulation time.
In the field of medicine, Melgani and Bazi (2008) used PSO in classifying ECG (electrocardiogram) beats and showed the advantage of the generalization capability with another classification algorithm, Support Vector Machine (SVM) approach.In this approach, a classifier was optimized by tuning its discriminate function upstream by looking for the best subset of features that feed the classifier.In particular, sensitivity has been tested using the SVM classifier by using three different base classifiers: k-Nearest Neighbour, RBF, and NN.
The Adaptive Michigan PSO (AMPSO) proposed by Cervantes (2009) used a number of different PSO versions.A single prototype in a swarm denotes each particle used in continuous classification problems.To overcome the risk of impulsive convergence, previous studies (Kennedy, 2001;Engelbrecht, 2007) have suggested changing traditional PSO operations to regroup swarms within a plausible subset of the original search space.
Nearest prototype methods (Cervantes, 2009) achieved reasonable resuts with various pattern based classification approaches.In this method, a number of prototypes were found that represented the input samples accurately.In these approaches, the classifier assigns classes based on the nearest neighbour.AMPSO is different from a simple PSO because each particle in a swarm represents a single prototype in the solution.In AMPSO, each particle acts as a local classifier and thus cannot converge to a single solution.Therefore all swarm are considered for the solution.It was found by comparing the results with other classifier mehods that AMPSO gives competitive results in all the problems, particularly where the k-NN classifier does not perform effectively.
In other variations of PSO, Cervantes (2007Cervantes ( , 2009)), Fan (2010), Elbedwehy (2012), and Tasgetiren (2004)  proposed a new optimization framework for improving feature selection in medical data classification.This framework sought to identify the optimal group of features showing strong divisive power between two classes.They concluded that this method can be used as a quick decision-making tool in real clinical settings.
In the next subsections we describe the simple PSO method and the classification methods that we have used in our approach.The proposed approach is defined in Section 3.

Simple Particle Swarm Optimization (SPSO)
The PSO algorithm uses a population (called a swarm) of individual solutions (called particles) to find the best swarm solution iteratively.An initial solution is proposed for each particle (location and velocity) and then tested to see if a better overall solution (for the swarm) can be found according to some criteria.In PSO, each particle flies in the search space with a velocity adjusted by its own and its companion's history.In every iteration, each particle is updated by following two "best" values.The first one is the best solution (fitness) it has achieved so far.(The fitness value is also stored.)This value is called p id (pbest).Another "best" value that is tracked by the particle swarm optimizer is the best value, obtained so far by any particle in the population.This best value is a global best and called p gd (gbest).Each particle has an objective function value, which is decided by a fitness function: where i represents the i th particle and d is the dimension of the solution space, c 1 denotes the cognitive learning factor, and c 2 indicates the social learning factor, r 1 and r 2 are the uniformly distributed random numbers in [0,1], P id and P gd stand for the position with the best fitness found so for the ith particle and best position in the neighbourhood, v id (t) and v id (t-1) are the velocities at time t and time t-1, respectively, and x id is the position of the i th particle at time t.Each particle then moves to a new potential solution depend on the following equation: (Kennedy, 1997) proposed a binary PSO in which a particle moves in a state space restricted to 0 and 1 in each dimension, in terms of the changes in probabilities that a particle (bit) will be in one state or the other: When applying PSO to the problem of feature selection, we use a binary digit to represent a feature.The bit values 0 and 1 represent non-selected and selected features, respectively.Each particle is coded to a binary alphabetic string.The PSO for the problem of feature selection in this study is called simple PSO (SPSO) (Wang, 2007).For example, the particle 101000 with six features means that the first and third features are selected.The function S(v) is a sigmoid limiting transformation and rand( ) is a random number selected from a uniform distribution in [0.0, 1.0].

k-NN Classifier
The (k-NN) technique was described by Fix and Hodges (Fix, 1951).It is a fundamental technique in data mining and machine learning and has been applied in many domains.This method classifies new cases based on similarity measures (ex., distance functions).The output is a class membership showing an inclination toward one class over the others.A majority vote of its neighbours decides the class.K, the number of nearest neighbors that need to be considered, is a positive number that can be assigned by the user or automatically by the program.
If the class of test data matches the expected class of the pattern, we assume that it will be counted as a correctly predicted example.The fitness function is defined as the accuracy of classification, where accuracy is defined as the number of corrected predicted example divided by the total number of examples.

Naive Bayesian Classifier
A naive Bayes classifier is a simple probabilistic classifier technique based on the Bayes theorem and is especially well-matched when the input data are highly dimensional.The naive Bayes classifier method considers that the value of a particular attribute is distinct to the presence or absence of any other attributes, given the class attributes.In spite of its simple approach, the Naive Bayes approach many times outperforms more complicated classification methods (Langley, 1992).Naive Bayes classifiers make significant use of the assumption that all input features are conditionally independent, i.e., assuming that the presence or absence of a particular feature is unrelated to the presence or absence of any other feature, given the class label.Only a small amount of training data is required to correctly classify through Naive Bayes.However, the hypothesis of conditional independence is not applicable in various real-world problems where relationships are present between the input features.The algorithm begins with choosing a training set from the full dataset that is analyzed resulting in a subset of relevant features.These results are then tested using a new test set, which is also a subset of the entire dataset.

PROPOSED APPROACH AND ALGORITHM
Then the training data and the test data are converted to a reduced new training set and a new test set by eliminating the features that have not been selected.A classification algorithm is trained (learned) from the converted training data.The trained learning algorithm is then applied to the converted test data to obtain the final testing classification performance.
The proposed method for feature selection makes extensive use of a regression model to select a subset of features.The mathematical model of the proposed method is based on a simple concept derived from the PSO algorithm that utilizes each and every particle to search out local space and find the mutual understanding of each particle.The flow diagram of the simple particle swarm optimization (SPSO) method is depicted in Figure 2. The concept can be described as follows: The classification accuracy y i is used as a dependent variable while the binary variables x id are treated as independent variables.Therefore, the regression model can be defined as: where σ is the intercept, and the λs are regression coefficients.The assumption is that if a feature's contributions to the accuracy are positive, then the value of λ i should be positive.Some of the features have a positive value, λ i >0 and x ii =0, and increase the accuracy but fail to be selected by the simple PSO.Such types of features must be in the selected list of features to check if such subsets can increase the accuracy rate.On the other hand, if the PSO selects some features that have negative values, i.e., λ i < 0 and x ii =1, this can reduce the accuracy.Thus these types of features should be eliminated from the selected list.The proposed approach works with the help of a regression method, which gives more accurate results.The MHRFS method is described as follows: MHRFS Algorithm: 1. Calculate accuracy y i for each particle, i = 1,...,N.
2. Find the coefficient λ j of each feature by meta-heuristic (PSO) model.

3.
Let X i new = x i .

If λ
6.If the accuracy value Y i is less than Y i new , then x ij = x ij new and the fitness value Yi = accuracy value Y i new.

EXPERIMENTAL RESULTS
To investigate the effectiveness of the proposed approach, we used seven data sets.Classification accuracy is used as the evaluation criterion with the first nearest neighbor used to measure the accuracy.In addition, 10-fold cross validation and random sampling were utilized.

Data Sets
The seven data sets from the UCI repository (Bache, 2013) have sizes ranging from hundreds to thousands of data items and are described in Table 1.The seven data sets cover a wide vareity of measurements and have been the subject of extensive studies for high dimensional systems, serving as a test bed for many PSO-based feature selection algorithms (Azevedo, 2007;Marinakis, 2008;Yang, 2008).To allow comparison with previous PSO based approaches, a number of input features were taken from the literature (summarized in Table 1).
In the experiments, all of the data in each data set were randomly divided into two sets: 70% as the training set and 30% as the test set.During the training process, each particle (individual) represented one feature subset.
The classification performance of a selected feature subset was evaluated by 10-fold cross-validation on the training set.Note that 10-fold cross-validation was performed as an inner loop in the training process to evaluate the classification performance of a single feature subset on the training set and it did not generate ten feature subsets.After the training process, the selected features were evaluated on the test set to obtain the testing classification error rate.All of the algorithms were wrapper approaches, i.e., required a classification algorithm in the training process to evaluate the classification performance of the selected feature subset.Any classification algorithm can be used here, such as Naive Bayes, Decision Tree, and Support Vector Machine.One of the simplest and most commonly used classification algorithms, K-nearest neighbour (KNN), a Bayesian classifier, (Langley, 1992;Chuang, 2011), was used in the experiments.We defined K=5 in the classifier to simplify the evaluation process, and implemented the process in the Java machine learning library (Java-ML) (Abeel, 2009).
The proposed MHRFS based on the PSO algorithm presented for the feature selection problem was implemented in C and run on an Intel i7 2.6 GHz, 4GB Ram Machine.Evaluation of the MHRFS was assessed by a conventional genetic algorithm (GA).For this purpose, a GA was implemented in C and testing was done on randomly distributed data.The GA was a conventional one with a uniform crossover, simple inversion mutation, and a tournament selection of size 2. In the experimental analysis, we defined some parameters for the conventional GA and the proposed approach.In the proposed MHRFS, the size of the population in the swarm was taken to be the twice the number of whole features.Parameters c1 and c2, the social and cognitive parameters respectively, are kept at 2.Here c1 and c2 are learning factors.For the conventional GA and proposed MHRFS, the size of the population was kept the same.The crossover and mutation rates were 0.70 and 0.10, respectively.On average, the GA and PSO techniques were executed for 1-50 iterations.Table 2 gives the accuracy rate for different iterations of the data sets presented in Table 1.

Comments on Selected Features
Using our proposed algorithm, we achieve a high classification rate for the combination of a small number of features.However, while with any increment in the subset of features, the results show consistent classification accuracy, the time consumption increases rapidly.In some cases, as more features are included, the classification rate tends to slow down.For example, the Sonar dataset described in Table 1 produced an accuracy rate of 59.16% in 18 iterations whereas increasing the number of iterations to 32 gave a 61.23% accuracy rate.Proposed algorithm runs iteratively for selecting feature subsets.Although each time some of the features may be common, distinct features have been selected by the proposed algorithm.It should be noted that the governing features can be estimated within the feature subset by calculating optimal number of iterations that also have high classification accuracy.For calculating optimal number of iterations we select feature subsets through our algorithm until it produces constant accuracy.As the number of iterations increases, the classification rate becomes stable while showing stable accuracy for smaller data sets from the beginning.Some of the high dimensional dataset characteristics are summarised in Table 1.On average, our proposed approach performed well in selecting representative features, as described in Table 2, for the data sets mention in Table 1.

Evaluation based of classification accuracy
Using feature selection and constraint optimization should increase the classification rate performance as well as decrease the response time.The proposed algorithm selects very few important features and support vectors and should reduce size and time of execution and also improve classification accuracy.Chen, 2013).We have checked these algorithms for different iterations.Our algorithm outperforms the others some of the time, but it is a bit difficult to say that one method achieves better accuracy for all the datasets and with every iteration.Two classifiers (Bayesian and k-NN) were used to measure the accuracy of the feature selection method.We found that our algorithm gives good results with both classifiers.

CONCLUSION
Despite much research on the PSO-based feature selection in the field of machine learning, there is still a shortage of high quality analytical techniques for high dimensional datasets.It is unclear how to construct a better feature selection algorithm for a specific parameter setting and classifier.In this paper, we evaluated our MHRFS technique against other well known feature selection techniques.For evaluation, we used two classifiers: k-NN and Naive Bayes.For testing purposes, we used three microarray and three non-biological but high dimensional datasets.We found that optimization of our feature selection algorithm sometimes increases the accuracy of the prediction in a comparatively reduced time span and shows good accuracy in most cases.
The proposed approach could be used as a pre-processing tool to facilitate the optimization of feature selection methods as it can be used to increase classification accuracy.

Figure 2 .
Figure 2. Flow diagram of SPSO (Simple Particle Swarm Optimization) for feature selection

A
new algorithm, the Meta-Heuristic Regression Based Feature Selection (MHRFS), is proposed to investigate and improve the performance of PSO for feature selection.An overview of a PSO based feature selection algorithm has been given above.The basic PSO based algorithm (SPSO) is described as the baseline to test the performance of the newly proposed algorithm.A new fitness function, new initialisation strategies, and new pbest and gbest updating mechanisms are then proposed to improve the performance of PSO for feature selection.The terms pbest and gbest are defined in Section 2.1.The framework of the training and testing process of a PSO based feature selection technique is shown in Figure 2.

Figure 2 .
Figure 2. The Framework of PSO based feature selection methods

Table 1 .
Data sets and their characteristics

Table 2 .
The number of selected features by the proposed MHRFS method, including the classification rate for the original data set (applied before feature selection).The Classification accuracy with the feature selection technique is presented in Table3.
Table 2 lists the prediction accuracy rate for different iterations in a 10-fold cross validation.We compared our algorithm (MHRFS) to other well known feature selection algorithms: Simple PSO (SPSO, Wang, 2007) , Regression Based PSO (RBPSO, Chen, 2013), and Backward Regression Based PSO (BRPSO,

Table 3 .
Classification accuracy comparison between different feature selection based on PSO approaches