Shapelet Classification Algorithm Based on Efficient Subsequence Matching

Shapelet classification algorithms are an accurate classification method for time series data. Existing shapelet classifying processes are relatively inefficient and slow due to the large amount of necessary complex distance computations. This paper therefore introduces piecewise aggregate approximation(PAA) representation and an efficient subsequence matching algorithm for shapelet classification algorithms; the paper also proposes shapelet transformation classification algorithm based on efficient series matching. First, the proposed algorithm took the PAA representation for appropriate dimension reduction, and then used a subsequence matching algorithm to simplify the data classification process. The research experimented on 14 public time series datasets taken from UCI and UCR, used the original and new algorithm for classification, and compared the efficiency and accuracy of the two methods. Experimental results showed that the efficient subsequence matching algorithm could be combined with the shapelet classification algorithm; the new algorithm could ensure relatively high classification accuracy, effectively simplified the algorithm calculation process, and improved classification efficiency.


Introduction
As a type of high-dimensional massive data, time series are common in fields such as meteorology, finance, geology, medicine, electronic information, and network security. They are also a major research subject in data mining (Esling and Agon 2012). Time series research includes similarity searching (Rakthanmanon et al. 2012), clustering (Aghabozorgi and Wah 2014), classification (Petitjean et al. 2015), pattern recognition (Begum and Keogh 2014), and prediction (Aljumeily and Hussain 2015). Among these, time series classification (TSC) has become a hot topic because of its fundamentality. Time series classification obtains identification features that can distinguish between different time series by learning from training sets with known class tags, and then automatically assign class tags to untagged time series.
Initially, the research staff used the nearest neighbor algorithm to process time series classifications (Ding et al. 2008;Batista et al. 2011;Deng et al. 2013;Alonso et al. 2005;Jeong et al. 2011;Buza 2011). Despite the fact that the nearest neighbor algorithm was simple and involved fewer parameters, new research suggested that it needed to search and store the entire dataset during the time series classification process, which resulted in relatively high time and space complexity. Researchers hoped to achieve high classification accuracy and derive implicit messages from the experiment; this could not be achieved with the nearest neighbor algorithm. Additionally, these methods often resulted in unsatisfactory results because some time series were very similar, and the resulting noise could obscure the subtle differences between similar time series. Therefore, the above algorithm was not effective at classifying time series that had subtle differences.
Researchers have been working to solve the above problem with a new classification algorithm that better solves time series classification problems. Ye, Keogh (2009), and other researchers first introduced shapelet algorithms to classify time series that only had minor partial differences. Shapelet algorithms use partial time series fragments for classification, which reduce noise and lead to better accuracy and robustness.

Shapelet Classification Algorithm Based on Efficient Subsequence Matching
Shapelet classification could also produce results with higher explanatory power, which could clearly show class differences and help researchers better understand data. Since then, shapelet classification algorithms have been widely used in various fields involving time-series studies (Hartmann 2010;Xing et al. 2011;Shajina et al. 2012). Compared with the existing classification, shapelet time series classification algorithms were more accurate, but the shapelet extraction process was slow, which made it prohibitive for very large datasets. Therefore, shapelet classification research has mostly focused on accelerating the extraction process. Ye and Keogh (2011), Mueen (2011), He (2012, Rakthanmanon (2013), and other researchers proposed improved algorithms that expedited the process. Lines and Bagnall (2012) comprehensively analyzed the pros and cons of several quality metrics during the extraction process. However, these improvements did not fundamentally address the problem of how to best use shapelet classification algorithms to solve time series classification. Bagnall (2013) and other researchers demonstrated the importance of using an integrated approach to isolate data transformation from the classification algorithm. Lines, Davis (2012), and other researchers proposed the concept of shapelet transformation, and broke the restriction requiring shapelet classification to use decision trees. They utilized the distance of the original time series from the shapelets to convert data and create a new dataset, and then used the generic classifier for classification.
This article introduces PAA time series representation and an efficient subsequence matching method in the shapelet classification algorithm, and proposes an improved shapelet conversion classification algorithm. The proposed algorithm preprocesses the original data with a PAA time series representation to reduce data dimensions, and then uses highly efficient subsequence matching methods to simplify the subsequence distance calculation during the extraction and conversion processes of the shapelet classification algorithm to reduce computing complexity and improve efficiency. We made the following contributions: (1) We proposed a shapelet conversion classification algorithm based on highly efficient subsequence matching; (2) We studied the impact of PAA representation to process the original time series on shapelet classification; (3) We carried out experiments on real datasets and validated that the proposed method is feasible and efficient; (4) We analyzed the results using a variety of common classifiers to convert shapelet classification data. This paper is organized as follows. Section 2 briefly provides necessary definitions. Section 3 describes the proposed shapelet conversion classification algorithm based on highly efficient subsequence matching. Section 4 includes our experiment on a public dataset, shows the experimental results, and presents our analysis and discussion of the results. Finally, Section 5 summarizes the paper.

Definitions and notation
The key terms are as follows: Time series: A time series is a series of chronologically ordered real data obtained at regular intervals, T = t 1 , t 2 ,…, t m , in which t i can be any infinite number and m is the length of T. Time series subsequence: A time series subsequence is a fragment of a complete series, , in which l is the length of S (l < m), and i is the subsequence starting position. Time series classification: For a time series collection with size n, Q = {T 1 , T 2 ,…, T n }, in which T i is consist of m real-valued attributes and a class label c. That is, The task of time series classification is to classify the time series of T i , and assign class label c to each. Time series Euclidean distance: The Euclidean distance of time series S 0 and T 0 that are the same length is the sum of corresponding square dot difference, i.e.,

Shapelet transformation classification algorithm based on efficient subsequence matching
The shapelet transformation method is much more accurate than traditional classification algorithms. However, the high computational complexity of the optimal shapelet extraction process is very time consuming. Therefore, the efficient subsequence matching algorithm was introduced to the shapelet transformation method. The efficient subsequence matching algorithm applies the strategy of roughly screening first, then finely screening second, which eliminates unnecessary calculations based on rough estimates to obtain a set of possible matching subsequence. Then, it uses the DTW distance calculation method to accurately calculate the final matching subsequence and the distance. Applying an efficient subsequence matching algorithm during the optimal shapelet extraction process can significantly reduce series distance calculation complexity and ultimately improve algorithm classification efficiency.

PAA time series representation
PAA representation was applied to high-dimensional time series to achieve efficient storage and simplified computation. PAA representation is a general approximation representation method, which was proposed by Keogh (2011). It is useful for dimension reduction of time series, it has relatively good indexing speed and flexibility, and it also slightly de-noises. As shown in Figure 1, PAA representation segments time series based on fixed length, which divides the series into same-length segments and takes the average of each segment to approximately represent the series segments and establish an index. PAA representation is determined by the time series' compression ratio v (ie segment length); the larger the v, the greater the dimension reduction, which means more information will be lost; on the contrary, the smaller the v, the less the dimension reduction, which means higher approximate representation quality. Therefore, when applying PAA representation, it is important to balance dimension reduction and quality.

Efficient subsequence matching algorithm
The most basic but deterministic part of time series data mining tasks is calculating the distance between the time series and matching based on their similarities. The commonly used methods for calculating the distance for a large number of high-dimensional, non-aligned time series are very computationally complex, which means that they are very time consuming despite simple Euclidean distance. Vineetha Bettaiah et al. (2014) proposed an efficient time series subsequence matching method to solve this problem. The method ignores small fluctuations within the time series and identifies crests and troughs that will significantly determine the overall shape of time series. It treats local maximum and minimum points as the main breakpoints, segments the time series, matches the rough prior to the actual distance computation to get possible matching series segments, and computes the accurate value.
Algorithm 1: Efficient_subsequence_matching (T 1 , T 2 ) (p 1 , p 2 , p 3 , …, p N ) = Finding_Breakpoints (T 1 ); (q 1 , q 2 , q 3 , …, q M ) = Finding_Breakpoints (T 2 ); A = Relational_Matrix (p 1 , p 2 , p 3 , …, p N ); B = Relational_Matrix (q 1 , q 2 , q 3 , …, q M ); C = Matching_Matrix (A, B); Matching_List = Matching_Breakpoints (C); return (Matching_List); The algorithm first divides the time series into monotonous non-decreasing segments and monotonous non-increasing segments. It then treats each endpoint segment as the local minimum or minimum value of each time series, and calculates based on the increment (decrement) after the maximum value. It calculates the average increment or decrement value of the corresponding maximum value, selects the points with absolute values above the average as key breakpoints, and then creates indexes with its corresponding series number in the time series and point value. It then checks and gets the time series between the adjacent local minimum value points to ensure no omissions exist, and gets the final set of key segments. As shown in Figure 2, the time series partition with the key time series segment breakpoints and endpoints.
Create a set {p 1 , p 2 , p 3 ,…, p N } with the key breakpoints extracted from the time series T 1 , and construct a N*N logical matrix A with this set a ij , which has any elements in A, is a vector from p i to p j , which indicates the relationship between p i and p j . Similarly, construct the M*M logical metrics B with key breakpoints {q 1 , q 2 , q 3 ,…, q M }. If the relationship between p i and p j within T 1 is similar to the relationship between q l and q k within T 2 , then the logical vector a ij and b lk are approximately the same. In this case, the series of points p i and p j may match series of q l and q k , and point p i corresponds to q l , p j corresponds to q k , respectively.
Iterate through vectors in matrices A and B to construct a matching matrix C, and compute the matching of each breakpoint in C. If c il of C is a large value, points p i and q l is most likely match; if the value of c jk is 0, p j and q k are less likely a match. The algorithm provides a rough estimate and may lead to false positives. It therefore requires verifying calculations after the matching process to remove false matches. Then, it determines the ultimate matching points according to the value, calculates the accurate distance, and takes the minimum as the distance of the time series subsequence.

Shapelet transformation classification algorithm based on efficient subsequence matching
Shapelet conversion classification algorithms extract the local time series characteristics, ignore data without obvious features, and replace overall data with distinguishing parts to classify. Shapelet conversion algorithms have greatly improved efficiency and accuracy, but the computational complexity of the shapelet extraction process is still high. For a dataset Q with n time series of length m, the candidate shapelets series number is O(nm 2 ), and the computation complexity for the distance of each shapelet and Q is O(nm 2 ), thus, the complexity of the entire shapelet extraction algorithm reaches O(n 2 m 4 ). Therefore, shortening the time series length or simplifying the calculation distance can effectively improve the shapelet extraction algorithm efficiency. So, the PAA time series representation and an efficient subsequence matching algorithm were correspondingly introduced to improve shapelet time series classification efficiency.
Since the original time series is too long and its classification features may only be reflected in some segments, using a common classifier will produce results only slightly better than random guessing, which provides no practical value. Therefore, features are extracted in a training set, namely shapelets extraction, to extract a class of time series that is most different from other fragment types. When dealing with the new dataset, the shapelets are used to transform the original time series, and then build a common classifier for classification. As shown in Figure 3, the marked part is one of the series which has better distinguishing features, i.e., the optimal shapelet.

Standardization and dimension reduction of the original series
Scaling may be different in the experimental data, so it is necessary to standardize to ensure that matching is performed in the same dimension to achieve the best matching results. Then, use the PAA representation mentioned in section 3.1 to perform dimension reduction to the original data within an acceptable simplification range. To represent T = t 1 , t 2 ,…, t m with PAA representation with segment length v, we get , wherein the segment length v is the compression ratio. It has good approximation to use PAA representation to represent time series, which can effectively achieve dimensional reduction of the original time series.

Shapelet candidate selection
Generally, the algorithm iterates original time series with a specified range with a sliding window algorithm to obtain all shapelet candidates. For a time series containing n datasets Q = T 1 , T 2 ,…, T n , the candidate set of its shapelets series is the union of candidate sets of each series. Setting the shapelet length as l, we can obtain (m-l) + 1 shapelet candidates within a time series of length m. The standardized subsequence of length l obtained from the series can be expressed as W i,l , then, all subsequence sets of length l in dataset Q are:

Efficient series matching algorithms to extract the optimal shapelets
Due to high computation requirements, the time series distance calculation generally uses a simple Euclidean distance metric. From Section 2, we know that we can take the minimum distance of S and all subsequence in T i with length l as the distance between the time series T i and shapelet S of length l, i.e., Shapelet extraction tasks determine the most distinguished shapelets. Thus, absolute subsequence distance accuracy is not required. We can calculate the distance of shapelet S to all series in dataset Q with an effective subsequence matching algorithm: We need to assess shapelet quality to obtain the best classification shapelets. The most common methods are information gain, the Kruskal-Wallis test, the F statistical test, and the Mood median test. We use the classification quality of each shapelet as an indicator to sort all shapelets and select the first k 0 shapelets as the preliminary results. We need to process the preliminary shapelets to make shapelets more accurately and comprehensively represent the time series class characteristics. First, there could be overlapping shapelets when they are extracted from the same time series, resulting in redundant computation.
Thus, we need to filter the series with an overlapping exponent e, to remove shapelets that overlap more with others. Second, to further reduce the number shapelets, simplify calculation, and extend shapelet dissimilarity, we need to cluster shapelets with exponent k and select a shapelet from each class as to represent time series features more comprehensively.

Shapelet transformation of the original series
After the above steps, we obtained the final k shapelets. Then, the shapelets were used to transform the original series. Shapelet transformation converts the shapelet classification problem to a general classification problem, so that the solution is no longer restricted to a decision tree, but a variety of common classifiers.
Shapelet transformation is achieved by calculating the subsequence distance. For dataset Q, we calculated the distance of T i to k shapelets subsequence D i,1 , D i,2 ,…, D i,k , where D i,k = dist(S k , T i ). We created P i = D i,1 , D i,2 , …, D i,k as a new entity in the dataset, and constructed P 1 , P 2 ,…, P n as a new dataset P, i.e., we transformed the dataset. In the new dataset P, the entity P i represents the original time series T i , and each column attributes of the entity was associated with a shapelet. We used a common classifier to classify the new dataset P to determine the class of the original series.

Computational Experiments
The experiments were conducted in the Java environment integrating with the Weka platform. The computer's configurations were as follows: Windows 7, 8G memory, Intel (R) Core (TM) i7-3770 CPU @ 3.40 GHz.
The experiments were designed to verify the feasibility of integrating the PAA representation and efficient subsequence matching method into the shapelets conversion classification algorithm. The experiments consisted of the following steps: 1. To select the appropriate parameters of PAA Representation, we applied two different time series classification methods, including direct classification and the shapelet classification method based on PAA Representation. We completed ten-fold cross validation on the classification of the whole dataset with the Naive Bayes classifier and analyzed the runtime and classification accuracy. 2. We applied conventional shapelet extraction based on PAA Representation with and without efficient sequence matching to process the whole dataset respectively, and compare the computation complexity. 3. We completed train-test classification with SVM, logistic regression, C4.5 decision trees, random forests, and other general classification algorithms to verify the improved algorithm's accuracy.

Test Data
Part of the experimental data consisted of five datasets from the UCR Time Series Database including ECGFiveDays, GunPoint, DiatomSizeReduction, Ham, and Herring. The rest comes from UCI series library shared by Professor Keogh's experiment team at the University of California, which included a total of 8 datasets of the X-ray image contour series of human finger bones at different ages (infant, youth, juvenile). As shown in Table 1, these 13 public datasets were divided into training and test sets in the experiments. The experimental data was considered to be generalized and representative because records with various time series, lengths, and classes were included in the datasets.

Quality Evaluation of Shapelets Extraction
In the early stage, information gain was characterized as the indicator of shapelets extraction quality (Ye and Keogh 2011;Mueen and Keogh 2011). Information gain (IG) is an asymmetric metric measurement method used to measure the difference between two probability distributions. In classification, information gain is calculated in terms of data properties, and can be used to measure each property's information size. In section 3.3, based on the sorted distance set D s , the quality of candidate series S can be evaluated by calculating the maximum information gain of every possible split point (sp). Relative information gain using KW, F-stat, and MM does not need clearly segmented D s , and can significantly reduce the overhead time . Jon Hills et al. (2014) demonstrated that in most time series dataset classifications, F-stat performed better in classification accuracy and time consumption in shapelet quality evaluation compared with other indicators. They suggested, "The F-stat should be the default choice for shapelet quality." The F statistic is used for testing hypotheses on the mean difference of the dataset consisting of C class samples. The statistical value of the hypothesis test indicated the difference proportion within and between groups. The greater the statistical value, the greater the difference between groups and the smaller the difference within a group. High-quality shapelets have smaller distances to inner class members, and have larger distances to members outside the class. Therefore, shapelets with a good classification quality will generate greater F-stat values. For D s = <D s,1 D s,2 ,…,D s,n >, they will be grouped based on their categories so that D i may include all distances between the candidate shapelet S and the time series in the corresponding category i. Then, the F-stat for quality evaluation of shapelet S is: n is the number of time series, D is the overall mean of D, and i D is the average distance from the shapelet to all time series in category i.

PAA compression ratio selection
The PAA representation compression ratio directly affects the reduction degree and the time series information integrity. The time series features need to be reserved as much as possible in classification. Therefore, both the simplification and accuracy degree should be considered in the compression ratio selection. The following experiments were conducted to select the appropriate compression ratio. Experiment 1: We selected 100 shapelets with a length of 5-30 and the compression ratio of 1-5 in PAA Representation. We analyzed classification accuracy based on the ROC curve and the AUC area below it. Figure 4 shows the results of a representative experiment generated with DP_Middle dataset. Figure 4(a) shows the ROC curve by applying the direct classification without shapelet extraction. Figure 4(b-f) show the classification results after applying shapelets extraction and PAA Representation. Table 2 shows the detailed AUC values and the corresponding run times.
The AUC value in Figure 4(a) was about 0.62, which was only slightly higher than random guessing accuracy. This is because the feature segments with characteristic identification are only a small part of the entire time series, and in direct classification, it is difficult to identify their characteristics with other influencing factors such as noisy data. As a result, the time series cannot be accurately classified. The AUC values in Figure 4(b-f) gradually reduced from 0.89 to 0.77, and the run times reduced from 72.5 hours to  1.4 hours, which was due to the increase of dimension reduction and information loss resulting from the increasing compression ratio. Therefore, shapelet extraction can significantly improve time series classification accuracy. As v increases, runtime decreases and classification accuracy gradually decreases. After analysis and comparison, when v = 3, run time and accuracy achieve a balance for favorable experimental results. So, the following experiments were developed with v = 3.

Shapelet classification algorithm based on efficient sequence matching
The following experiments were designed to validate the feasibility of the new algorithm based on PAA representation and the efficient subsequence matching method on shapelet extraction optimization and significant computational complexity reduction.
Experiment 2: We selected 100 shapelets with a length of 5-30 and a compression ratio of 3 in PAA representation. We applied conventional shapelets extraction, shapelets extraction combined with PAA representation, and shapelets extraction based on both PAA representation and efficient sequence matching, and recorded the run times. Table 3 shows the results.
From Table 3, for all of the time series datasets involved in the experiment, utilizing PAA representation and efficient subsequence matching in shapelet extraction significantly improved computational efficiency. The shapelet extraction process of the ECGFiveDays Dataset was accelerated 21.3 times, and the remaining datasets were accelerated by about 28-32 times. It was inevitable that the experiment would suffer from time inefficiencies, such as computing preparation time. The small magnitude of the ECGFiveDays dataset affected the results. However, the time consumption was negligible for the remaining datasets with larger magnitudes.
Experiment 3: We selected 100 shapelets with lengths of 5-30 and a compression ratio of 3 in PAA representation to complete the "train -test" standard classification. First, we applied optimal shapelet extraction to training datasets; then, we utilized shapelets to convert the training datasets, and used SVM, logistic regression, C4.5 decision trees, random forests, and other general classification algorithm to classify the converted datasets. Classification accuracy as shown in Table 4.
These classification algorithms showed good performance in converted dataset classification accuracy. The AUC values were generally 0.7 or more. The optimal classification algorithm can even make the AUC values be 0.85 or more on datasets except Ham. The accuracy of the Ham dataset was relatively low due to high data similarity. As shown in Figure 5, comparing the accuracy of different classification algorithms on different datasets, the SVM and random forest performed better on the time series datasets with smaller magnitudes. With the increase of magnitude, the logistic regression algorithm surpassed other algorithms and achieved the highest accuracy, while the SVM classifier still showed good performance. Overall, the accuracies of the C4.5 decision tree and the KNN classification algorithm were relatively low, while the SVM classifier generated the optimal classification results. As discussed above, combined with the PAA representation and efficient sequence matching algorithm, the efficiency of shapelets conversion classification algorithm can be improved, and run time can be reduced. The improved shapelets conversion classification algorithm had better adaptability. It kept high classification accuracy with various classifiers, in which SVM, logistic regression, and random forests integrating with efficient sequence matching have relatively better performance.

Conclusions
In this paper, we proposed improved shapelet conversion classification algorithm, which integrated PAA representation with efficient sequence matching algorithms. The improved algorithm effectively solved time consumption problems in the optimal shapelet extraction process, greatly improved computational efficiency, and efficiently and accurately classified the high-dimensional time series e. We performed experiments on 13 experimental datasets. The results showed that the improved shapelets classification algorithm had general feasibility in achieving better classification results in different time series types and magnitudes. Future work would examine ways to further improve subsequence-matching speed, seek better methods for dimension reduction instead of PAA notation, and analyze the adaptability of various classifiers on shapelets classifications.

Funding Information
We are grateful to Tony from the University of East Anglia, who provided helpful shapelet code and data. This work was financially supported by the National Youth Science Foundation of China (No.61503272), Scientific and technological project of Shanxi province of China (No.201603D221037-2).

Competing Interests
The authors have no competing interests to declare.