DISCOVERY OF TELECONNECTIONS USING DATA MINING TECHNOLOGIES IN GLOBAL CLIMATE DATASETS

In this paper, we apply data mining technologies to a 100-year global land precipitation dataset and a 100-year Sea Surface Temperature (SST) dataset. Some interesting teleconnections are discovered, including well-known patterns and unknown patterns (to the best of our knowledge), such as teleconnections between the abnormally low temperature events of the North Atlantic and floods in Northern Bolivia, abnormally low temperatures of the Venezuelan Coast and floods in Northern Algeria and Tunisia, etc. In particular, we use a high dimensional clustering method and a method that mines episode association rules in event sequences. The former is used to cluster the original time series datasets into higher spatial granularity, and the later is used to discover teleconnection patterns among events sequences that are generated by the clustering method. In order to verify our method, we also do experiments on the SOI index and a 100-year global land precipitation dataset and find many well-known teleconnections, such as teleconnections between SOI lower events and drought events of Eastern Australia, South Africa, and North Brazil; SOI lower events and flood events of the middle-lower reaches of Yangtze River; etc. We also do explorative experiments to help domain scientists discover new knowledge.


INTRODUCTION
In recent years, because of the development of information technology, the amount of data has grown explosively.In particular, earth science data has been rapidly accumulating with the development of modern-day satellites, remote sensing technologies, and other data acquisition systems.Traditional analysis methods of earth science data are not good enough.The main statistical methods, such as RPCA (Rotated Principal Component Analysis) and SVD (Singular Value Decomposition), have been used to discover teleconnection patterns.However, they just do not fit the need of the data's growth.A data mining method that discovers episode association rules in a long event sequence can be applied to discovering relationships among the events.This paper introduces an association data mining method that discovers teleconnection patterns and does experiments on real global earth science data to find many well-known and previously unknown patterns, such as an abnormally low sea surface temperature (SST) in the Eastern Pacific or an abnormally high SST in the Western Pacific coincides with abnormally high precipitation in Shanxi; and an abnormally low SST in the Northern Pacific coincides with abnormally low precipitation in Finland.The experimental results prove the feasibility and efficiency of this method.

DATA MINING METHOD
In this section, we describe the data mining method used to discover teleconnections in global climate datasets.Figure 1 illustrates the steps.

Preprocessing
It is necessary to preprocess earth science data because data of different grids is not compatible.Preprocessing earth science data means getting rid of periodicity and handling data of different time and space intervals.
The first preprocessing method is standardization, This allows variables to have the same weight (Han & Kamber, 2001).The standardization method we perform changes variables in every grid to have the same weight.
The second preprocessing method is getting rid of periodicity.Some patterns of earth science data are well known and may be important.For example, seasonal and yearly variations are very common; if we don't preprocess earth science data, we may get seasonal patterns after data mining instead of other interesting patterns, as the latter may be not as obvious as the former.Therefore we use a monthly z-score (Tan, et al, 2001) and moving average to get rid of periodicity.
The third preprocessing method is transforming data of different granularities.Because of different applications and different data collection methods, dissimilar space and time intervals in earth science datasets are common.Therefore, we must transform the data into uniform spatial and temporal intervals before we analyze them.

Spatial clustering
Before extracting extreme events, we need to cluster the earth science data.First, as the amount of data is huge, the number of events is very large, making difficulties for analysis.After clustering, we can focus only on events of the same type.Second, spatial self-correlation will distort the results because it allows many similar association rules, most of which are not interesting.
In this paper, we use the SNN algorithm (Ertoz, et al., 2001) to cluster earth science data.This algorithm uses a new definition of similarity.At first it finds the nearest neighbors of each data point and then redefines the similarity between pairs of points in terms of how many nearest neighbors the two points share.Using this definition of similarity, the SNN algorithm identifies core points and then builds clusters around them.SNN can handle data sets of different sizes, shapes, and densities and those with high dimensionality.
Because the run-time complexity of the SNN algorithm is O(n 2 ), the computing cost becomes huge when the data sets are large.We focus on this and analyze the inclusion of nearest neighbors, which proves the spatial autocorrelation.We can take advantage of this spatial autocorrelation to reduce the time complexity.

Extracting an extreme event episode
After cluster the earth science data, we need to extract space-time extreme events and build the episodes.Earth scientists estimate whether an extreme event is happened by considering these factors: rarity, in other words, the frequency of the event; the numerical value magnitude; the spatial-temporal range influenced by the extreme event; its difference from the average value; and the influence on society.

Extracting Extreme
Event Episode Episode Association Mining

Episode association mining
As we know, sequences of events are common forms of data that contain important knowledge.Episodes are patterns in event sequences, in other words, combinations of events with a partially specified order.The algorithm (Mannila & Toivonen, 1996) we used is based on minimal occurrences of episodes.First of all, we found a frequent simple episode of size k; then we formed candidate episodes of size k+1.Next we checked the candidate episodes for frequency, and repeated the steps until all frequent episodes were found.Using the above algorithm, we can form association rules and compute their confidence and support.
Considering the character of the earth science domain, we were able to make improvements.We found that it would take some time for one phenomenon to influence another.Therefore, we took time delay into consideration.
For spatial self-correlation, we found many useless or uninteresting rules, which often inundated the interesting ones.In order to avoid this, we added spatial restrictions, so that our algorithm can focus on interesting rules.

EXPERIMENTAL RESULTS
In this section, we consider an application of our episode association rules data mining method to earth science data.The data referred in this paper contains CRU (Global Earth Science Data) TS 2.10, SOI (Southern Oscillation Index), and NINO3.4 (Sea Surface Temperature of El Niño Zone).
We did two kinds of experiments.One tries to prove the correctness and feasibility of our methods and the other looks for interesting patterns within the data.

Land rainfall and the SOI exponent
This experiment tries to find abnormal rainfall areas involved with an abnormal SOI Exponent.We discovered the rules below:

Land rainfall and NONI3.4 exponent
This experiment is similar to the first one and found similar rules.The table and figure below have the same meaning as the above ones.As the four experiments above illustrate, we have found results similar to those in the domain knowledge.In other words, our method is reasonable and effective.While the association rules appear correct, the mechanism behind

Figure 1 .
Figure 1.Data mining steps Rule 1: Rainfall was abnormally low when the El Niño phenomenon occurred.(Red); Rule 2: Rainfall was abnormally high when the El Niño phenomenon occurred.(Blue); Rule 3: Rainfall was abnormally low when La Nina phenomenon occurred.(Yellow); Rule 4: Rainfall was abnormally high when La Nina phenomenon occurred.(Green) The table below shows the confidence and support of the rules we found, and the figure shows the results.In this figure, the different areas are shown in different colors.The circular regions are the ones found in earth science.

Figure 1 .
Figure 1.Areas with abnormal rainfall which relate to an abnormal SOI 3.1.2Land temperature and the SOI exponent This experiment tries to find an association between abnormal earth temperature and the SOI exponent.We discovered the results below: Rule 1: Temperature was abnormally high when the El Niño phenomenon occurred (Red); Rule 2: Temperature was abnormally low when the El Niño phenomenon occurred (Yellow); Rule 3: Temperature was abnormally low when the La Nina phenomenon occurred (Blue).

Figure 2 .
Figure 2. Areas with abnormal temperature which relate to an abnormal SOI The table above shows the rules we found, and the figure shows the results.The circular result areas are proved by domain knowledge.Also the phenomenon of rule 2 only appears in Yunnan Province, China.

Figure 3 .
Figure 3. Areas with abnormal rainfall which relate to an abnormal NONI3.4

Figure 4 .
Figure 4. Areas with abnormally temperatures which relate to an abnormal NONI3.4

Table 1 .
Association rules between SOI and global events of abnormal rainfall

Table 2 .
Association rules between SOI and global events of abnormal temperature

Table 3 .
Association rules between NONI3.4 and global events of abnormal rainfall

Table 4 .
Association rules between NONI3.4 and global events of abnormal temperatures