The discovery of frequent itemsets is one of the very important topics in data mining. Frequent itemset discovery techniques help in generating qualitative knowledge which gives business insight and helps the decision makers. In the Big Data era the need for a customizable algorithm to work with big data sets in a reasonable time becomes a necessity. In this paper we propose a new algorithm for frequent itemset discovery that could work in distributed manner with big datasets. Our approach is based on the original Buddy Prima algorithm and the Greatest Common Divisor (GCD) calculation between itemsets which exist in the transaction database. The proposed algorithm introduces a new method to parallelize the frequent itemset mining without the need to generate candidate itemsets and also it avoids any communication overhead between the participated nodes. It explores the parallelism abilities in the hardware in case of single node operation. The proposed approach could be implemented using map-reduce technique or Spark. It was successfully applied on different size transactions DBs and compared with two well-known algorithms: FP-Growth and Parallel Apriori with different support levels. The experiments showed that the proposed algorithm achieves major time improvement over both algorithms especially with datasets having huge number of items.

Frequent itemsets discovery “is one of the most important techniques in data mining” (

The Frequent itemset mining (FIM) methodologies are most popular and very well used by various researchers for finding frequent patterns among variables in large datasets (

There are many data mining techniques like association rules mining, correlation, sequential pattern analysis, classification and clustering that try to find interesting patterns in datasets. “The problem of mining frequent itemsets arose first as a sub-problem of mining association rules” (

This paper is organized as follows. Section 2 formulates the problem statement of the FIM problem. Section 3 briefly introduces a literature overview and related work. Section 4 presents our proposed approach. Section 5 discusses our experimental results. Finally, section 6 summarizes our conclusions and future work directions.

“Studies of Frequent Itemset (or pattern) Mining is acknowledged in the data mining field because of its broad applications in Market basket analysis, Medical diagnosis, Protein sequences, Census data, CRM of credit card business” (

In this section a brief overview about the well-known FIM algorithms is presented. It includes: the AIS algorithm, Apriori, FP-Growth and algorithms that depend on prime numbers representation.

_{1}, which is used to find the frequent 2-itemset L_{2} and so on” (_{k} from L_{k–1} 1)

The join step: To find L_{k}, a set of k-itemsets is generated by joining L_{k–1} with itself (_{k–1} contains items (A, B, C) then L_{k} set will contain (AB, AC, BC). This set of candidate itemsets is denoted C_{k}. 2)

The prune step: C_{k} is a superset of L_{k} that contains all frequent and non-frequent k-itemsets.

A scan of the database is done to determine the frequency of each candidate which exists in Ck, those who satisfy the minimum support is added to L_{k} for the next iteration and the other items are pruned.

_{k}^{′} are in the form <TID, {X_{k}}> and each X_{k} is a potentially frequent k-itemset present in the transaction with identifier TID” (

Then C_{1} will be in form of (X_{k=1}, Support Count).

_{1} = {({1}, 2), ({2}, 3),({3}, 3), ({5}, 3)}

C_{1} frequent itemsets are used to generate itemsets for C_{2} and the support count for each itemset generated in C_{2} is calculated using another subset called C_{2}^{′}.

_{2} (itemsets) = {{1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}}

C_{2}^{′} is derived from the original database taken into consideration only items equal or more than support threshold.

_{2}^{′} = {(100, {1 3}), (200, {2 3}{2 5}{3 5}), (300, {1 2}{1 3}{1 5} {2 3} {2 5} {3 5}), (400, {2 5}),………}

To get the frequency of C_{2} elements the search is done in C_{2}^{′}.

_{2} = {({1,2}, 1), ({1,3}, 2), ({1,5}, 1), ({2,3}, 2), ({2,5}, 3), ({3,5}, 2)}

This method may lead to less number of scans.

“

In this section brief overview about prime numbers representation algorithms will be introduced:

In this paper, we applied a new approach to parallelize Buddy Prima algorithm and optimize it to work more efficiently with Big Data and exploit the hardware parallelism capabilities. The new algorithm, called

Data preparation:

Items and accordingly transactions are represented in the same manner as done by the Buddy Prima algorithm illustrated in the previous section. POBPA analyzes the transaction dataset and extracts single items with frequency equal to or higher than the support count specified by the user. Also it prepares ignore list for infrequent items and their combinatorials. After removing these infrequent items from the original dataset it calculates prime multiplications and string representation.

Frequent Itemsets Deduction (Greatest Common Divisor calculation (GCD)):

By calculating the GCD between two transactions we can get the common items between these two transactions and by counting the highest repeated GCDs we get the frequent itemsets in the transaction dataset.

POBPA is not an architecture dependent algorithm it could be applied using different architectures like distributed environments, multiple core processors or GPUs. In this study POBPA exploits the multithreading feature which exists in JAVA to take the advantage of multithreading and multicore processors. POBPA works as multiple data single instruction (MDSI) with all nodes participating equally in the mining process to avoid any communication overhead between different nodes. POBA creates lightweight processes as the overhead of switching between threads is less. The Java Virtual Machine (JVM) spawns threads when the algorithm is run. Each thread works with part of the data and JVM takes the responsibility of orchestrating between them and consolidating the results.

POBPA uses horizontal representation for transaction datasets (

Dataset Horizontal Representation.

TID | ITEMS |
---|---|

T1 | I1, I2, I3, I4, I5, I6 |

T2 | I1, I2, I4, I7 |

T3 | I1, I2, I4, I5, I6 |

T4 | I1, I2 |

T5 | I3 |

T6 | I2 |

T7 | I7 |

The first step during the data preparation phase is counting the frequency of each individual item in the dataset. Once items frequencies are determined, the algorithm will prepare the ignore list which includes a set of items not to be considered in our further analysis. Ignoring items depends on the support count specified by the algorithm user. Support count is the minimum frequency for items according to user mining requirements (

Ignore List for Support Count 3.

Ignore List |
---|

I3 |

I5 |

I6 |

I7 |

I3 I5 |

I3 I6 |

I3 I7 |

I5 I6 |

I5 I7 |

I6 I7 |

I3 I5 I6 |

I3 I5 I7 |

I3 I6 I7 |

I5 I6 I7 |

Our algorithm assigns a prime number to each item in the dataset except the ones in the ignore list. Prime number assigning is done with respect to item’s frequency. The highest frequency item will be represented by the minimum prime number which is 2 and so on. The inverse proportional relationship between the item frequency and the assigned prime number will result in less value for prime multiplication in the subsequent steps and hence less storage and less computational complexity.

After this, each transaction matching one of the combinatorials existing in the ignore list will be removed from the transaction database. The remaining transactions will be represented in two representations, string ordered representation with all prime numbers replacing items in ascending order in one string and prime multiplication representation. Table

Transactions Representations.

Transaction ID | Items | Items Repetition and Prime Assignment | Repetition | String ordered representation | Prime Multiplication |
---|---|---|---|---|---|

T1 | I1, I2, |
R(I1) = 4, R(I2) = 5 and |
3 | 2 3 5 | 30 |

T2 | I1, I2, I4, |
||||

T3 | I1, I2, I4, |
||||

T4 | I1, I2 | P(I1) = 3 |
1 | 2 3 | 6 |

T5 | |||||

T6 | I2 | P(I2) = 2 | 1 | 2 | 2 |

T7 |

The next step is partitioning the transaction dataset depending on the string representation, making a subset that includes transactions starting with 2 then a subset which starts with 3 and so on. This partitioning provides the ability to parallelize the subsequent operations and also helps to reduce the computations. The following example illustrates the partitioning process. Suppose that after the row transactions were processed as per the previous steps the string representation was as illustrated in Table

Partitioning Process.

Transaction ID | String Representation | Partition Number |
---|---|---|

T1 | 2 3 7 | |

T2 | 3 7 11 | |

T3 | 11 19 23 | |

T4 | 2 23 29 | |

T5 | 3 31 37 | |

T6 | 11 53 59 | |

T7 | 29 31 37 |

As illustrated in Table

The proposed method depends on calculating GCD between the prime multiplications of each subset resulting from the partitioning and calculating cross GCD among subsets. Before starting the computation of itemset frequencies, a pool of subsets is created- as illustrated in the previous section- to give the distributed nodes, ALUs or processor cores the ability to handle subsets in parallel.

For each subset we start calculating GCD using Euclidean method. The Euclidean algorithm, is an efficient method for computing the greatest common divisor (GCD) of two numbers. It is based on the principle that the greatest common divisor of two numbers does not change if the larger number is replaced by the difference between it and the smaller number. For example, 21 is the GCD of 252 and 105 (252 = 21 × 12 and 105 = 21 × 5), and the same number 21 is also the GCD of 105 and 147 (147 = 252 – 105). “Since this replacement reduces the larger of the two numbers, repeating this process gives successively smaller pairs of numbers until one of the two numbers reaches zero”. “When that occurs, the other number (the one that is not zero) is the GCD of the original two numbers” (wikipedia, 2016).

The transactions in each subset are ordered lexicographically. The Algorithm calculates GCD between each transaction in the subset and the other transactions till we reach the exit criteria. The exit criteria depends on the string representation, since the string contains the transaction’s items ordered ascendingly by the prime numbers values then if during calculation of GCD of a certain transaction with other ones the last item in the string representation of this transaction is less than the second item of the other transaction’s representation, that means no GCD can be found between these two transactions and also between this transaction and all the subsequent transactions in the subset. Each GCD value with its frequency will be placed in a hash table structure. The frequency is incremented whenever this same GCD value is encountered again in a subsequent transactions.

As an example suppose we have the following subset which starts by 2: ((2,3,5) [30], (2,3,5,7) [210], (2,3,7) [42], (2,3,11) [66], (2,13) [26], (2,13,17) [442], (2,13,17,19) [8398]). The numbers show each transaction represented as string ordered representation followed by its representation as prime numbers multiplication. Note that any two transactions in the same subset have by definition a common item but during GCD frequency calculation only two or more items in common (two or more prime numbers) are considered as the frequency of single items are already calculated from the preparation phase as illustrated in section 4.1. Calculations will start with transaction (2, 3, 5) computing the GCD between it and the rest of the transactions in the same subset first. To calculate GCD we use the prime representations of the two transactions – shown between curved brackets – and the resulted GCDs with their frequencies are stored in a hash table. For transaction (2, 3, 5) the exit criteria will be met when we reach transaction (2, 13) since the second item in transaction (2, 13) has greater value than the last number in transaction (2, 3, 5) then no GCD can be found between them except 2 and the lexicographic order means that no GCD can be found between (2, 3, 5) and any transaction after (2,13). After calculating the GCDs between all transactions in the subset we need to check GCDs between transactions across different subsets. As an example for transaction (2, 3, 5) by removing 2 we get a new transaction (3, 5). If (3, 5) already exists in the subset of 3 then we increase its frequency by one. Repeat this with (5, 7) also for the subset starting with 5. The result of this phase is one table containing all GCDs and their frequencies.

We experimented with datasets from different applications. These datasets were obtained from the UCI repository of machine learning databases (

Datasets Used For the experiment.

Dataset Name | Number of Transactions | Number of Items |
---|---|---|

Mushroom | 8124 | 90 |

letRecog | 20000 | 106 |

Adult | 48842 | 97 |

Retail | 88146 | 16469 |

We compared the execution time of the proposed algorithm with FP-Growth and Parallel Apriori. The performance metric in the experiments is the total execution time taken and it was shown that POBA made a tremendous improvement especially with datasets that have huge number of items. We applied our experiments on the previously illustrated datasets using 30% to 70% minimum support threshold.

Table

Test Results with Retail Dataset.

Time Consumed in Seconds | |||
---|---|---|---|

Support Count | FP-Growth | Parallel Apriori | POBPA |

30% | 28 | 37 | <0.05 |

40% | 26 | 34 | <0.05 |

50% | 22.1 | 32.9 | <0.05 |

60% | 22.04 | 32.3 | <0.05 |

70% | 21.19 | 27 | <0.05 |

Table

Test Results with Mushroom Dataset.

Time Consumed in Seconds | |||
---|---|---|---|

Support Count | FP-Growth | Parallel Apriori | POBPA |

30% | 59 | 13 | 0.88 |

40% | 48 | 3.9 | 0.64 |

50% | 32 | 1.9 | 0.15 |

60% | 30 | 1.5 | 0.07 |

70% | 28 | 1.4 | <0.05 |

Test Results with Letter Recognition.

Time Consumed in Seconds | |||
---|---|---|---|

Support Count | FP-Growth | Parallel Apriori | POBPA |

30% | 273 | 38 | 24 |

40% | 200 | 23 | 14 |

50% | 179 | 18 | 3 |

60% | 150 | 18 | 1.2 |

70% | 110 | 15 | 0.05 |

Test Results with Adult dataset.

Time Consumed in Seconds | |||
---|---|---|---|

Support Count | FP-Growth | Parallel Apriori | POBPA |

30% | 400 | 24 | 1.8 |

40% | 328 | 10 | 1.5 |

50% | 300 | 8 | 0.82 |

60% | 213 | 6 | 0.75 |

70% | 200 | 4 | 0.5 |

In Table

In this paper we have introduced a new technique for FIM. Our approach, POBPA, proved to excel other approaches found in the literature. POBPA is not an architecture dependent algorithm it could be applied using different architectures like distributed environments and multiple core processors. It aims to parallelize the Buddy Prima algorithm and also to optimize it to work more efficiently with Big Data exploiting the hardware parallelism capabilities. It depends on GCD calculation between different transactions in the transaction database. It assigns prime numbers in inverse proportional relationship to the item frequency which guarantees less value for prime multiplication and hence less storage and less computational complexity. POBPA achieves excellent results when comparing its execution time with FP-Growth and Parallel Apriori especially for dataset with large number of items.

Future work directions include applying POBPA to a distributed architecture with datasets containing millions of records. It could be applied using Hadoop Distributed File System (HDFS) and Map Reduce technique. Hadoop uses distributed file system method, which basically implements a mapping system to locate data in a cluster then MapReduce programming is used in the same servers which allows faster processing of data. It can deal with large volumes of unstructured data. Hadoop MapReduce takes minutes to process terabytes of data. POBPA could be applied on stream analytics to identify the response of social media to a certain topic.

The authors have no competing interests to declare.