Shapelet classification algorithms are an accurate classification method for time series data. Existing shapelet classifying processes are relatively inefficient and slow due to the large amount of necessary complex distance computations. This paper therefore introduces piecewise aggregate approximation(PAA) representation and an efficient subsequence matching algorithm for shapelet classification algorithms; the paper also proposes shapelet transformation classification algorithm based on efficient series matching. First, the proposed algorithm took the PAA representation for appropriate dimension reduction, and then used a subsequence matching algorithm to simplify the data classification process. The research experimented on 14 public time series datasets taken from UCI and UCR, used the original and new algorithm for classification, and compared the efficiency and accuracy of the two methods. Experimental results showed that the efficient subsequence matching algorithm could be combined with the shapelet classification algorithm; the new algorithm could ensure relatively high classification accuracy, effectively simplified the algorithm calculation process, and improved classification efficiency.

As a type of high-dimensional massive data, time series are common in fields such as meteorology, finance, geology, medicine, electronic information, and network security. They are also a major research subject in data mining (

Initially, the research staff used the nearest neighbor algorithm to process time series classifications (

Researchers have been working to solve the above problem with a new classification algorithm that better solves time series classification problems. Ye, Keogh (

This article introduces PAA time series representation and an efficient subsequence matching method in the shapelet classification algorithm, and proposes an improved shapelet conversion classification algorithm. The proposed algorithm preprocesses the original data with a PAA time series representation to reduce data dimensions, and then uses highly efficient subsequence matching methods to simplify the subsequence distance calculation during the extraction and conversion processes of the shapelet classification algorithm to reduce computing complexity and improve efficiency. We made the following contributions: (1) We proposed a shapelet conversion classification algorithm based on highly efficient subsequence matching; (2) We studied the impact of PAA representation to process the original time series on shapelet classification; (3) We carried out experiments on real datasets and validated that the proposed method is feasible and efficient; (4) We analyzed the results using a variety of common classifiers to convert shapelet classification data.

This paper is organized as follows. Section 2 briefly provides necessary definitions. Section 3 describes the proposed shapelet conversion classification algorithm based on highly efficient subsequence matching. Section 4 includes our experiment on a public dataset, shows the experimental results, and presents our analysis and discussion of the results. Finally, Section 5 summarizes the paper.

The key terms are as follows:

_{1}, _{2},…, _{m}, in which _{i}

_{1}, _{2},…, _{n}_{i}

The task of time series classification is to classify the time series of _{i}

_{0} and _{0} that are the same length is the sum of corresponding square dot difference, i.e.,

The shapelet transformation method is much more accurate than traditional classification algorithms. However, the high computational complexity of the optimal shapelet extraction process is very time consuming. Therefore, the efficient subsequence matching algorithm was introduced to the shapelet transformation method. The efficient subsequence matching algorithm applies the strategy of roughly screening first, then finely screening second, which eliminates unnecessary calculations based on rough estimates to obtain a set of possible matching subsequence. Then, it uses the DTW distance calculation method to accurately calculate the final matching subsequence and the distance. Applying an efficient subsequence matching algorithm during the optimal shapelet extraction process can significantly reduce series distance calculation complexity and ultimately improve algorithm classification efficiency.

PAA representation was applied to high-dimensional time series to achieve efficient storage and simplified computation. PAA representation is a general approximation representation method, which was proposed by Keogh (2011). It is useful for dimension reduction of time series, it has relatively good indexing speed and flexibility, and it also slightly de-noises. As shown in Figure

PAA representation of time series.

PAA representation is determined by the time series’ compression ratio

The most basic but deterministic part of time series data mining tasks is calculating the distance between the time series and matching based on their similarities. The commonly used methods for calculating the distance for a large number of high-dimensional, non-aligned time series are very computationally complex, which means that they are very time consuming despite simple Euclidean distance. Vineetha Bettaiah et al. (

(_{1}, _{2}, _{3}, …, _{N}_{1});

(_{1}, _{2}, _{3}, …, _{M}_{2});

_{1}, _{2}, _{3}, …, _{N}

_{1}, _{2}, _{3}, …, _{M}

Matching_List = Matching_Breakpoints (

return (Matching_List);

The algorithm first divides the time series into monotonous non-decreasing segments and monotonous non-increasing segments. It then treats each endpoint segment as the local minimum or minimum value of each time series, and calculates based on the increment (decrement) after the maximum value. It calculates the average increment or decrement value of the corresponding maximum value, selects the points with absolute values above the average as key breakpoints, and then creates indexes with its corresponding series number in the time series and point value. It then checks and gets the time series between the adjacent local minimum value points to ensure no omissions exist, and gets the final set of key segments. As shown in Figure

Subsequence matching section.

Create a set {_{1}, _{2}, _{3},…, _{N}_{1}, and construct a N*N logical matrix _{ij}_{i}_{j}_{i}_{j}_{1}, _{2}, _{3},…, _{M}_{i}_{j}_{1} is similar to the relationship between _{l}_{k}_{2}, then the logical vector _{ij}_{lk}_{i}_{j}_{l}_{k}_{i}_{l}, p_{j}_{k}

Iterate through vectors in matrices _{il}_{i}_{l}_{jk}_{j}_{k}

Shapelet conversion classification algorithms extract the local time series characteristics, ignore data without obvious features, and replace overall data with distinguishing parts to classify. Shapelet conversion algorithms have greatly improved efficiency and accuracy, but the computational complexity of the shapelet extraction process is still high. For a dataset ^{2}), and the computation complexity for the distance of each shapelet and ^{2}), thus, the complexity of the entire shapelet extraction algorithm reaches ^{2}^{4}). Therefore, shortening the time series length or simplifying the calculation distance can effectively improve the shapelet extraction algorithm efficiency. So, the PAA time series representation and an efficient subsequence matching algorithm were correspondingly introduced to improve shapelet time series classification efficiency.

Since the original time series is too long and its classification features may only be reflected in some segments, using a common classifier will produce results only slightly better than random guessing, which provides no practical value. Therefore, features are extracted in a training set, namely shapelets extraction, to extract a class of time series that is most different from other fragment types. When dealing with the new dataset, the shapelets are used to transform the original time series, and then build a common classifier for classification. As shown in Figure

Time series shapelet.

Scaling may be different in the experimental data, so it is necessary to standardize to ensure that matching is performed in the same dimension to achieve the best matching results. Then, use the PAA representation mentioned in section 3.1 to perform dimension reduction to the original data within an acceptable simplification range. To represent _{1}, _{2},…, _{m}

Generally, the algorithm iterates original time series with a specified range with a sliding window algorithm to obtain all shapelet candidates. For a time series containing _{1}, _{2},…, _{n}_{i,l}

Then, all candidate shapelets set within

Set

Due to high computation requirements, the time series distance calculation generally uses a simple Euclidean distance metric. From Section 2, we know that we can take the minimum distance of _{i}_{i}

Shapelet extraction tasks determine the most distinguished shapelets. Thus, absolute subsequence distance accuracy is not required. We can calculate the distance of shapelet

We need to assess shapelet quality to obtain the best classification shapelets. The most common methods are information gain, the Kruskal-Wallis test, the F statistical test, and the Mood median test. We use the classification quality of each shapelet as an indicator to sort all shapelets and select the first _{0} shapelets as the preliminary results.

We need to process the preliminary shapelets to make shapelets more accurately and comprehensively represent the time series class characteristics. First, there could be overlapping shapelets when they are extracted from the same time series, resulting in redundant computation.

Thus, we need to filter the series with an overlapping exponent

After the above steps, we obtained the final

Shapelet transformation is achieved by calculating the subsequence distance. For dataset _{i}_{i}_{,1}, _{i}_{,2},…, _{i,k}_{i,k}_{k}, T_{i}_{i}_{i}_{,1}, _{i}_{,2}, …, _{i,k}_{1}, _{2},…, _{n}_{i}_{i}

_{i} in

_{i} = PAA (_{i}, v

_{i,l}_{i}, l

_{i,l}

Matching_List = Efficient_sunseries_matching (

_{s}

_{s}_{s}

shapelets.add (_{s}

shapelets = Taking_First__{0}(Reorder (shapelets, _{s}

shapelets = Filter_Selfsimilar (shapelets);

k_shapelets = Cluster (shapelets,

Classification_Result = General classification (

return (Classification_Result);

The experiments were conducted in the Java environment integrating with the Weka platform. The computer’s configurations were as follows: Windows 7, 8G memory, Intel (R) Core (TM) i7-3770 CPU @ 3.40 GHz.

The experiments were designed to verify the feasibility of integrating the PAA representation and efficient subsequence matching method into the shapelets conversion classification algorithm. The experiments consisted of the following steps:

To select the appropriate parameters of PAA Representation, we applied two different time series classification methods, including direct classification and the shapelet classification method based on PAA Representation. We completed ten-fold cross validation on the classification of the whole dataset with the Naive Bayes classifier and analyzed the runtime and classification accuracy.

We applied conventional shapelet extraction based on PAA Representation with and without efficient sequence matching to process the whole dataset respectively, and compare the computation complexity.

We completed train-test classification with SVM, logistic regression, C4.5 decision trees, random forests, and other general classification algorithms to verify the improved algorithm’s accuracy.

Part of the experimental data consisted of five datasets from the UCR Time Series Database including ECGFiveDays, GunPoint, DiatomSizeReduction, Ham, and Herring. The rest comes from UCI series library shared by Professor Keogh’s experiment team at the University of California, which included a total of 8 datasets of the X-ray image contour series of human finger bones at different ages (infant, youth, juvenile). As shown in Table

Test data.

Datasets | Partition | Instances(train/test) | Length | Number(classes) |
---|---|---|---|---|

ECGFiveDays | Train/Test | 23/861 | 136 | 2 |

GunPoint | Train/Test | 50/150 | 150 | 2 |

DiatomSizeReduction | Train/Test | 16/306 | 345 | 4 |

Ham | Train/Test | 109/105 | 431 | 2 |

Herring | Train/Test | 64/64 | 512 | 2 |

DP_Little | Train/Test | 400/645 | 250 | 3 |

DP_Middle | Train/Test | 400/645 | 250 | 3 |

DP_Thumb | Train/Test | 400/645 | 250 | 3 |

MP_Little | Train/Test | 400/645 | 250 | 3 |

MP_Middle | Train/Test | 400/645 | 250 | 3 |

PP_Little | Train/Test | 400/645 | 250 | 3 |

PP_Middle | Train/Test | 400/645 | 250 | 3 |

PP_Thumb | Train/Test | 400/645 | 250 | 3 |

In the early stage, information gain was characterized as the indicator of shapelets extraction quality (_{s}

Relative information gain using KW, F-stat, and MM does not need clearly segmented _{s}

The F statistic is used for testing hypotheses on the mean difference of the dataset consisting of C class samples. The statistical value of the hypothesis test indicated the difference proportion within and between groups. The greater the statistical value, the greater the difference between groups and the smaller the difference within a group. High-quality shapelets have smaller distances to inner class members, and have larger distances to members outside the class. Therefore, shapelets with a good classification quality will generate greater F-stat values. For _{s}_{s}_{,1} _{s}_{,2},…,_{s,n}_{i}

_{i}

The PAA representation compression ratio directly affects the reduction degree and the time series information integrity. The time series features need to be reserved as much as possible in classification. Therefore, both the simplification and accuracy degree should be considered in the compression ratio selection. The following experiments were conducted to select the appropriate compression ratio.

Figure

The ROC curve under different compression ratio.

Computing time and the value of AUC.

Value of v | Computing time (s) | The value of AUC |
---|---|---|

– | – | 0.6152 |

v = 1 | 261097 | 0.8949 |

v = 2 | 55889 | 0.8558 |

v = 3 | 20790 | 0.8418 |

v = 4 | 9970 | 0.8106 |

v = 5 | 5193 | 0.7671 |

The AUC value in Figure

Therefore, shapelet extraction can significantly improve time series classification accuracy. As

The following experiments were designed to validate the feasibility of the new algorithm based on PAA representation and the efficient subsequence matching method on shapelet extraction optimization and significant computational complexity reduction.

Comparison of computing time (s) between the improved and original algorithm.

Datasets | Traditional shapelet | Shapelet extract with PAA | Shapelet extract with PAA and efficient subsequence matching | Upgrade multiples of computing speed |
---|---|---|---|---|

ECGFiveDays | 32 | 3.6 | 1.5 | |

GunPoint | 195 | 16.4 | 6.7 | 29.1 |

DiatomSizeReduction | 1334 | 128 | 46.4 | 28.75 |

Ham | 6211 | 577 | 204 | 30.44 |

Herring | 4873 | 365 | 151 | |

DP_Little | 37541 | 3057 | 1287 | 29.17 |

DP_Middle | 38378 | 3106 | 1324 | 28.98 |

DP_Thumb | 38332 | 3096 | 1318 | 29.08 |

MP_Little | 38454 | 3122 | 1357 | 28.34 |

MP_Middle | 37661 | 3084 | 1306 | 28.84 |

PP_Little | 38339 | 3155 | 1388 | |

PP_Middle | 37854 | 3088 | 1315 | 28.79 |

PP_Thumb | 38287 | 3135 | 1373 | 27.89 |

From Table

General classifier Accuracy value using improved algorithm.

Datasets | C4.5 Decision Tree | Logistic Regression | SVM | Random Forests | KNN | Naïve Bayesian |
---|---|---|---|---|---|---|

ECGFiveDays | 0.9334 | 0.9413 | 0.9512 | 0.9566 | ||

GunPoint | 0.9323 | 0.9411 | 0.9025 | 0.9364 | ||

DiatomSizeReduction | 0.8324 | 0.8847 | 0.8522 | 0.8913 | ||

Ham | 0.7987 | 0.8214 | 0.8327 | 0.8185 | ||

Herring | 0.8668 | 0.8843 | 0.8992 | 0.9058 | ||

DP_Little | 0.7445 | 0.8336 | 0.7525 | 0.8425 | ||

DP_Middle | 0.7300 | 0.8377 | 0.7356 | 0.8418 | ||

DP_Thumb | 0.7364 | 0.8324 | 0.7412 | 0.8455 | ||

MP_Little | 0.7544 | 0.8367 | 0.7664 | 0.8441 | ||

MP_Middle | 0.7468 | 0.8654 | 0.8552 | 0.7630 | ||

PP_Little | 0.7568 | 0.8651 | 0.7811 | 0.8667 | ||

PP_Middle | 0.7633 | 0.8787 | 0.8600 | 0.7798 | ||

PP_Thumb | 0.7618 | 0.8631 | 0.7744 | 0.8725 |

These classification algorithms showed good performance in converted dataset classification accuracy. The AUC values were generally 0.7 or more. The optimal classification algorithm can even make the AUC values be 0.85 or more on datasets except Ham. The accuracy of the Ham dataset was relatively low due to high data similarity. As shown in Figure

Accuracy comparison with different classifiers.

As discussed above, combined with the PAA representation and efficient sequence matching algorithm, the efficiency of shapelets conversion classification algorithm can be improved, and run time can be reduced. The improved shapelets conversion classification algorithm had better adaptability. It kept high classification accuracy with various classifiers, in which SVM, logistic regression, and random forests integrating with efficient sequence matching have relatively better performance.

In this paper, we proposed improved shapelet conversion classification algorithm, which integrated PAA representation with efficient sequence matching algorithms. The improved algorithm effectively solved time consumption problems in the optimal shapelet extraction process, greatly improved computational efficiency, and efficiently and accurately classified the high-dimensional time series e. We performed experiments on 13 experimental datasets. The results showed that the improved shapelets classification algorithm had general feasibility in achieving better classification results in different time series types and magnitudes. Future work would examine ways to further improve subsequence-matching speed, seek better methods for dimension reduction instead of PAA notation, and analyze the adaptability of various classifiers on shapelets classifications.

We are grateful to Tony from the University of East Anglia, who provided helpful shapelet code and data. This work was financially supported by the National Youth Science Foundation of China (No.61503272), Scientific and technological project of Shanxi province of China (No.201603D221037-2).

The authors have no competing interests to declare.