AN ASSOCIATION RULE MINING ALGORITHM BASED ON A BOOLEAN MATRIX

Association rule mining is a very important research topic in the field of data mining. Discovering frequent itemsets is the key process in association rule mining. Traditional association rule algorithms adopt an iterative method to discovery, which requires very large calculations and a complicated transaction process. Because of this, a new association rule algorithm called ABBM is proposed in this paper. This new algorithm adopts a Boolean vector “relational calculus” method to discovering frequent itemsets. Experimental results show that this algorithm can quickly discover frequent itemsets and effectively mine potential association rules.


INTRODUCTION
Data mining is the key step in the knowledge discovery process, and association rule mining is a very important research topic in the data mining field (Agrawal, Imielinski, & Swami, 1993).The original problem addressed by association rule mining was to find a correlation among sales of different products from the analysis of a large set of supermarket data.At present, research work on association rules is motivated by an extensive range of application areas, such as banking, manufacturing, health care, and telecommunications.The discovery of association rules is typically done in two steps: discovery of frequent itemsets and the generation of association rules.The second step is rather straightforward, and the first step dominates the processing time, so we explicitly focus this paper on the first step.
A number of efficient association rule mining algorithms have been proposed in the last few years.Among these, the Apriori algorithm (Agrawal & Srikant, 1994) has been very influential.Since its inception, many scholars have improved and optimized the Apriori algorithm and have presented new Apriori-like algorithms, Klemetinen, Mannila, & Ronkainen (1994), Park, Chen, & Yu (1995), Toivonen (1996), Kotásek & Zendulka (2000), and Han, Pei, & Yin (2000).The Apriori-like algorithms adopt an iterative method to discover frequent itemsets.The algorithm starts from frequent 1-itemsets until all maximum frequent itemsets are discovered.The Apriori-like algorithms consist of two major procedures: the join procedure and the prune procedure.The join procedure combines two frequent k-itemsets, which have the same (k-1)-prefix, to generate a (k+1)-itemset as a new preliminary candidate.Following the join procedure, the prune procedure is used to remove from the preliminary candidate set all itemsets whose k-subset is not a frequent itemset.A huge calculation and a complicated transaction process are required during the two procedures.Therefore, the mining efficiency of the Apriori-like algorithms is very unsatisfactory when transaction database is very large.
In this paper, a new algorithm called ABBM is proposed.This algorithm transforms a transaction database into a Data Science Journal, Volume 6, Supplement, 9 September 2007 Boolean matrix stored in bits.Meanwhile it uses the Boolean vector "relational calculus" method to discover frequent itemsets.We use the fast and simple "and calculus'' in the Boolean matrix to replace the calculations and complicated transactions that deal with large numbers of itemsets.Experimental results show that this algorithm is more effective than the Apriori-like algorithms.

AN ALGORITHM BASED ON BOOLEAN MATRIX (ABBM)
In this section, we propose a new association algorithm.The section is organized as follows: the correlative definition and proposition, an introduction to the ABBM algorithm details, and a description of a sample execution of the ABBM algorithm.

Definition and proposition
Definition 1: Let I= {i 1 ,i 2 ,…,i n } be a set of literals, called items.Let D be an attribute and Dom(D) be the domain of D. A transaction database is a database containing transactions in the form of (d, E), where d∈Dom(D) and E ⊆ I.
Definition 2: Let D be a transaction database, m be the number of transactions in D, and minsup be the minimum support of D. The minimum support number minsupth is defined below: Definition 3: The Boolean matrix is a matrix with element values of '1' or '0.' Definition 4: The Boolean 'and calculus' is carried out to an arbitrary k columns vector of the Boolean matrix; the sum of '1' of the operation result is called k-support of the k columns vector.
Proposition 1: If the sum of '1' in a row vector A i is smaller than k, it is not necessary for A i attending calculus of the k-supports.
Rationale: According to the principle of the Boolean 'and calculus,' the result is '1' when the value of all vector elements is '1.'If the sum of '1' in a row vector A i is smaller than k, there is at least one '0' element in A i, .
Proposition 2: Itemset X is a k-itemsets; |L K-1 (j)| presents the number of items 'j' in all frequent (k-1)-itemsets of the frequent set L K-1 .There is an item j in X.If |L K-1 (j)| is smaller than k-1, itemset X is not a frequent itemset (Xu & Zhang, 2003).
Proposition 3: |L K | presents the number of k-itemsets in the frequent set L K .If |L K | is smaller then k+1, the maximum length frequent itemsets is k.
Rationale: Frequent (k+1)-itemsets X have k+1 frequent k-subsets.If the number of frequent k-itemsets in the frequent set L K is smaller than k+1, there are no frequent (k+1)-itemsets in the mined transaction database.

Algorithm Details
In this section, we will first present the ABBM algorithm step by step.In general, the ABBM algorithm consists of four phases as follows: 1. Transforming the transaction database into the Boolean matrix 2. Generating the set of frequent 1-itemsets L 1 3. Pruning the Boolean matrix 4. Generating the set of frequent k-itemsets L k (k>1) Data Science Journal, Volume 6, Supplement, 9 September 2007

S560
The detailed method, phase by phase, is presented below.

Transforming the transaction database into the Boolean matrix
The mined transaction database is D, with D having m transactions and n items.Let T={T 1 ,T 2 ,…,T m } be the set of transactions and I={I 1 ,I 2 ,…,I n }be the set of items.We set up a Boolean matrix A m*n , which has m rows and n columns.Scanning the transaction database D, if item I j is in transaction T i , where 1≤j≤n,1≤i≤m, the element value of A ij is '1,' otherwise the value of A ij is '0.'

Generating the set of frequent 1-itemset L 1
The Boolean matrix A m*n is scanned and support numbers of all items are computed.The support number I j .supth of item I j is the number of '1s' in the jth column of the Boolean matrix A m*n .If I j .supth is smaller than the minimum support number minsupth, itemset {I j } is not a frequent 1-itemset and the jth column of the Boolean matrix A m*n will be deleted from A m*n .Otherwise itemset {I j } is the frequent 1-itemset and is added to the set of frequent 1-itemset L 1 .
The sum of the element values of each row is recomputed, and according to Proposition 1, the rows whose sum of element values is smaller than 2 are deleted from this matrix.

Pruning the Boolean matrix
Pruning the Boolean matrix means deleting some rows and columns from it.First, the column of the Boolean matrix is pruned according to Proposition 2. This is described in detail as: Let I′ be the set of all items in the frequent set L K-1 , where k>2.Compute all |L K-1 (j)| where j∈I′, and delete the column of correspondence item j if |L K-1 (j)| is smaller than k-1.Second, recompute the sum of the element values in each row in the Boolean matrix.
According to Proposition 1, the rows of the Boolean matrix whose sum of element values is smaller than k are deleted from this matrix.

Generating the set of frequent k-itemsets L k
Frequent k-itemsets are discovered only by "and" relational calculus, which is carried out for the k-vectors combination.If the Boolean matrix A p*q has q columns where 2<q≤n and minsupth≤p≤m, k q c , combinations of k-vectors will be produced.The 'and' relational calculus is for each combination of k-vectors.If the sum of element values in the "and" calculation result is not smaller than the minimum support number minsupth, the k-itemsets corresponding to this combination of k-vectors are the frequent k-itemsets and are added to the set of frequent k-itemsets L k .
A detailed description of the ABBM algorithm is given in Figure1.

Example
This section describes a sample execution of the ABBM algorithm.The transaction data of the transaction database D are given in Table 1; the minimum support is 0.4; n=5 is the number of items, and m=5 is the number of transactions.Therefore, the minimum support number minsupsh=2.
The transaction database D is transformed into the Boolean matrix A 5*5 : Table 1

S562
We compute the sum of the element values of each column in the Boolean matrix A 5*5 and the set of frequent 1-itemset is: The fourth column of the Boolean matrix A 5*5 is deleted because the support number of item I4 is smaller than the minimum support number 2. We then compute the sum of the element values of each row in the Boolean matrix and delete all rows where the sum of the element values is smaller than 2. Finally, the Boolean matrix A 4*4 is generated.The operation of 3-supports is executed for all columns of the Boolean matrix A 3*4 , and the set of frequent 3-itemset is: According to Proposition 3, the ABBM algorithm is terminated because there are two frequent 3-itemsets in the set of frequent 3-itemset L 3 .

EXPERIMENT
In order to appraise the performance of the ABBM algorithm, we conducted an experiment using the Apriori algorithm and the ABBM algorithm.The algorithms were implemented in Visual C ++ 6.0 and tested on a WindowsXP Professional platform.The test database T20I4D100K was generated synthetically by an algorithm designed by the IBM Quest project.The synthetic data generation procedure can be found in detail in Agrawal & Srikant (1994), whose parameter settings we followed: The number of items N is set to 1000; |D| is the number of transactions; |T| is the averages size of transactions, and |I| is the average size of the maximum frequent itemsets.
Figure 4 presents the experimental results for different numbers of minimum supports.The results show that the performance of the ABBM algorithm is much better than that of the Apriori algorithm.Moreover, the better the performance efficiency of ABBM algorithm is, the smaller the minimum support is.This is because the smaller Data Science Journal, Volume 6, Supplement, 9 September 2007 S563 the minimum support, the more candidate itemsets the Apriori algorithm has to determine, and also the Apriori algorithm's join and pruning processes take more time to execute.However, the ABBM algorithm does not produce candidate itemsets, and it spends less time calculating k-supports with the Boolean matrix pruned.

CONCLUSION
In this paper, an association rule mining algorithm based on the Boolean matrix (ABBM) is proposed.The main features of this algorithm are that it only scans the transaction database once, it does not produce candidate itemsets, and it adopts the Boolean vector "relational calculus" to discover frequent itemsets.In addition, it stores all transaction data in bits, so it needs less memory space and can be applied to mining large databases.