DISCOVERING IMPERCEPTIBLE ASSOCIATIONS BASED ON INTERESTINGNESS : A UTILITY-ORIENTED DATA MINING APPROACH

This article proposes an innovative utility sentient approach for the mining of interesting association patterns from transaction databases. First, frequent patterns are discovered from the transaction database using the FPGrowth algorithm. From the frequent patterns mined, this approach extracts novel interesting association patterns with emphasis on significance, utility, and the subjective interests of the users. The experimental results portray the efficiency of this approach in mining utility-oriented and interesting association rules. A comparative analysis is also presented to illustrate our approach’s effectiveness.


INTRODUCTION
A noteworthy field of computer science that has attracted great interest for research and development from a huge variety of people is data mining.The principal motivation behind the emergence of data mining stems from decision support problems that have affected a large number of business organizations (Tsur, 1990;Wang et al., 1994).Data mining, also termed Knowledge Discovery in Databases (KDD), has been defined as "The nontrivial extraction of hidden, novel, and potentially useful information from data" (Frawley et al., 1992).Data mining makes use of machine learning and various statistical and visualization techniques so as to determine and represent knowledge in an easily interpretable form (Soundararajan et al., 2005).Data mining tasks can be generally classified into descriptive mining and predictive mining.The mined information is articulated as a model of the semantic structure of the dataset, where the prediction or classification of the collected data is facilitated by means of the model (Cunningham & Holmes, 1999).Recently, the incorporation of utility constraints into data mining tasks has emerged as an area of intensive research in data mining.
Utility-based data mining (Weiss et al., 2006;Yeh et al., 2007) is an extensive area that focuses on all aspects of economic utility in data mining and is aimed at incorporating utility into both predictive and descriptive data mining tasks.There have been a record number of transactional databases available, with computers and ecommerce gaining fast and widespread recognition.The chief focus of data mining on transactional databases is the mining of association rules that determine the correlation among items in transaction records.Data mining researchers are chiefly concerned with qualitative aspects of attributes (e.g., significance and utility) rather than quantitative ones (e.g., number of appearances in a database, etc), as qualitative properties are essential for exploiting completely the attributes present in the dataset.Within the data mining community, the discovery of interesting association rules has long been recognized as a way of improving the business utility of an enterprise.Business improvement demands the extraction of interesting association patterns that are both statistically and semantically significant to the business utility.
In recent times, incorporating utility constraints in itemset and association rule mining has gained enormous popularity (Yeh et al., 2007;Podpecan et al., 2007).We have presented a survey and comparative study of the significant research based on utility for itemset and association rule mining in Shankar and Purusothaman (2009).On the basis of the survey conducted, we have presented a novel utility sentient approach for mining interesting association rules in our earlier work (Shankar & Purusothaman, 2009).
In this article, we present our previous work, novel utility-based approach for the mining of high utility and interesting association patterns from transactional databases, with detailed results based on experimentation and a comparative analysis with Muyeba et al. (2008) and Khan et al. (2008) to emphasize the performance of the our research.The primary objective of our approach is to identify novel and interesting association rules from the historical buying patterns of customers, so as to increase the business of an enterprise.This approach makes use of the FP-Growth algorithm for mining frequent patterns.Subsequently, the mined frequent patterns are subjected to the computation of interestingness measure and utility weight (significance of an item).The process results in a class of novel, utility-oriented, and interesting association patterns.We compare the performance of our approach to two existing weighted association rule mining approaches Weighted ARM (Muyeba et al., 2008) and Weighted Utility ARM (Khan et al., 2008) to illustrate the efficacy of our approach.An experimental and comparative analysis of this approach shows that it identifies interesting patterns that are both statistically and semantically important in improving business utility.
The remainder of the paper is organized as follows: section 2 provides a brief review of the research related to our research.The approach proposed for mining utility sentient and novel interesting association patterns is presented in section 3. Section 4 presents the experimental results of our approach with comparative analysis.Section 5 sums up the paper.

REVIEW OF RELATED RESEARCH
A handful of research is available in the literature for mining frequency patterns and association rules based on weight and utility.A brief review of the significant research is presented here.Agrawal et al. (1993) have proposed a proficient algorithm that generates all important association rules from the database.Buffer management, novel estimation, and pruning techniques are the options added to the proposed algorithm.The problem of mining association rules from huge relational tables containing both quantitative and categorical attributes has been dealt with by Srikant et al. (1996).Moreover, they have established a measure of partial completeness, which quantifies the information lost due to partitioning.Aggarwal et al. (1998) have conducted an extensive survey of the research available for association rule mining.They have also discussed numerous variations of the association rule problem proposed in the literature and their practical applications.A method of mining the weighted association rule has been proposed by Cai et al. (1998).They have offered two distinct definitions for weighted support, with normalization and without normalization, and have also proposed a novel algorithm on the basis of the support bounds.
A two-fold approach for association rule mining has been presented by Wang et al. (2000).First, frequent itemsets were generated, and then the maximum weighted association rules were derived using an "ordered" shrinkage approach.Coenen and Leng (2001) have given a method for generating frequent itemsets that minimizes the task by using efficient restructuring of data accompanied by a partial computation of the totals required.Dong and Tjortjis (2003) have discussed an enhanced memory efficient data structure of a quantitative approach to mine association rules from data.They have combined the best features of the three algorithms (the Quantitative Approach, DHP, and Apriori) in their approach.Bodon (2003) has presented an implementation procedure for the Apriori algorithm that exceeded all other available implementations.
An efficient algorithm for association rule mining was presented by Wang and Tjortjis (2004).The enormous time requirement incurred for large itemset generation was substantially reduced by performing a single scan of the database and then employing logical operations to achieve large itemset generation.Verma et al. (2005) have presented an innovative algorithm for discovering association rules on time dependent data utilizing efficient T-tree and P-tree data structures.The algorithm achieves considerable benefit in terms of time and memory even after integrating the time dimension.Yuan and Huang (2005) have proposed the Matrix Algorithm for proficient generation of large frequent candidate sets.First, the algorithm generates a matrix containing values 1 or 0 by conducting a single pass over the concrete database.The resulting matrix is employed in the generation of frequent candidate sets.Ultimately, from the frequent candidate sets, association rules are mined.Palshikar et al. (2007) presented the concept of heavy itemsets that compactly symbolizes an exponential number of rules.They have given a proficient theoretical characterization of a heavy itemset and a proficient greedy algorithm for generating an anthology of disjoint heavy itemsets in a given transactional database.
An innovative utility-frequent mining model for the discovery of all itemsets that can generate a user specified utility in transactions has been presented by Yeh et al. (2007).They have proposed two sets of algorithms for efficiently mining utility-frequent item sets: 1) a bottom-up two-phase algorithm (BU-UFM) and 2) a top-down two-phase algorithm (TD-UFM).Podpecan et al. (2007) have presented an innovative and effective algorithm FUFM (Fast Utility-Frequent Mining) that discovers all utility-frequent itemsets within the given utility and supports a constraints threshold.

SUGGESTED APPROACH FOR MINING INTERESTING ASSOCIATION PATTERNS
Our approach for interesting association pattern mining is presented in this section.The primary aim of this research is to devise an efficient approach that can mine novel interesting association patterns from a transaction database of an enterprise so as to augment business development.Generally, a transaction database presents a good depiction of the buying patterns of a business's customers.Our approach intends to discover novel interesting association patterns that may or may not be frequent in the database but are likely to improve the enterprise's business.This approach incorporates three factors in addition to frequency for association pattern mining: 1) significance weight, 2) utility, and 3) subjective interestingness to the users.The aforesaid factors were selected for: item significance -each item in a transaction has a distinct level of significance based on its importance and item utility -each item has a subjective utility, such as profit in dollars or some other utility factor that influences business development.
Interestingness is the measure of the quantity of 'interest' that a pattern evokes upon inspection.Interestingness is considered to be a significant factor in data mining because it influences the extraction of novel and valuable (interesting) patterns and also plays a vital role in determining the efficacy of an enterprise.In general, the measures of interestingness can be categorized into objective and subjective measures.The objective measures of interestingness are commonly represented via statistical or mathematical criteria, while subjective measures deal with more realistic criteria such as efficiency or applicability.

Figure 1. Block diagram of our approach
Data Science Journal, Volume 9, 24 February 2010 The above figure depicts the block diagram of the proposed approach.The sequential steps involved in the proposed approach are as follows: 1. Computation of the significance weight for all items based on profit.2. Determination of frequent itemsets using the FP-Growth algorithm.
3. Determination of the frequent patterns with utility weight greater than a threshold value.4. Discovery of novel interesting association patterns from the selected patterns on the basis of the interestingness measure.
The input to the proposed approach is a transaction database that contains buying patterns of customers from basket data.The definitions involved in the process of association rule mining from a transaction database D are: Let .Two customary measures that serve as the basis for association rule mining are support and confidence.
Support of an itemset is defined as the measure of the number of transactions T containing the itemsets X and Y with respect to TDI.

TDI
Confidence of an association rule Y X ⇒ is defined as the ratio of the number of transactions that contain Y U X to the number of transactions that contain X .

Significance weight calculation
The first step of the proposed approach is the computation of the significance weight for all items based on profit.The significance weight corresponding to the items is calculated by where n = number of items and P = profit of an item.
Subsequently, the data items in the transaction database are represented via a matrix for pattern extraction, M .The patterns found in the transactions are then extracted as a set by making use of the above matrix: Where I Δ = item index of all elements in ij T .The FP-Growth algorithm is then employed to discover all the frequent patterns from the set of extracted transaction patterns n T .

FP-Growth algorithm
The FP-growth algorithm is one contemporary approach for mining frequent itemsets (Han et al., 2000).The FP-growth algorithm makes use of a prefix tree representation of the given transaction database (called an FP-tree).The FP-tree is used to lessen the amount of memory needed for storing transactions.The major steps involved in the FP-growth algorithm are: 1) Construction of a memory structure called the FP-tree.
2) Repeated application of the actual FP-growth procedure to the constructed FP-tree.
3) Discovery of all frequent itemsets by analyzing projections (conditional FP-trees) of the tree in a depthfirst manner pertaining to the frequent prefixes mined thus far.

Frequent pattern selection based on utility weight
The next step in frequent pattern mining is the selection of utility weighted frequent patterns.The set of frequent patterns mined is subjected to utility weight computation.The utility weight of a frequent pattern is computed using equation ( 5): where Wu = utility weight, f = frequency of pattern, and i SW = significance weight of the th i item in the current pattern.
Then, based on a predefined thresholdα , a set of patterns with higher utility weight is selected: where p S = set of selected patterns α = predefined threshold value, and Wu = utility weight.

Interesting association pattern mining
Following the utility weight computation, the next step is the mining of novel interesting association patterns from the selected frequent patterns.This sub-section describes the algorithm employed for the mining of these patterns.The subjective interestingness to users is used in the algorithm for the mining of interesting association patterns.The algorithm takes as input: 1) the frequent patterns selected from utility weight and 2) all the frequent patterns mined by the FP-Growth algorithm.The algorithm aims at mining novel interesting association patterns that are present in transactions but may or may not be frequent and likely to increase business.The steps involved in the proposed algorithm for mining interesting association patterns are: 1) Extracting all other frequent patterns containing each discrete item in the selected frequent pattern.
2) Obtaining the set difference of the selected itemset (pattern) from the extracted itemset.
3) Pairing every discrete item in the resulting patterns (set difference) with its reference item.
Thus, for every selected pattern, we obtain a set of patterns, each with two items.The pseudo code for the above operation is given below.4) Grouping all patterns from the set of patterns created for a selected pattern with an identical second item, the first item being the reference.5) Computing a single consolidated weight for all the pattern sets obtained in step 4 by adding the individual weights of patterns present.The weight of a pattern is the sum of all of its items' weights.

Assumptions
Item weight is the product of the frequency and significance weight of the item.Frequency of an item is defined as the occurrence count of the item in transactions.The frequency of item X in pattern Y → X is calculated using equation ( 7): where ( )

X tc
= the count of item X in the current transaction and n = the total number of transactions.6) Computing the interestingness weight of a pattern group by where w I = interestingness weight and m = number of patterns in the group.
7) Combining the second item from each pattern group with its corresponding selected pattern and assigning the interestingness weight of that group to the newly formed pattern.8) Sorting the newly formed patterns on the basis of their interestingness weight.On occurrence of more than one pattern with an identical last item, only the pattern with the highest interestingness weight is used in the subsequent process.The patterns with higher interestingness weight are selected based on a threshold value ψ.

EXPERIMENTAL RESULTS AND ANALYSIS
The results and analysis obtained from experimentation on the proposed approach are presented in this section.The proposed approach is programmed in Java.The profit and computed significance weight for individual items are depicted in Table 1.A sample transaction database is shown in Table 2. Tables 1 and 2 illustrate the results of the algorithms described in sections 3.2, 3.3, and 3.4.The frequent patterns discovered by the FPgrowth algorithm with their corresponding frequencies are tabulated in Table 3. Table 4 portrays the utility weight of the generated frequent patterns.Table 5 contains the set of patterns with utility weight greater than 2.0.The novel patterns discovered with their interestingness weight are tabulated in Table 6.Table 8 shows the final set of novel interesting patterns selected from Table 7 with a minimum threshold of 10.0.A discussion of the results obtained from experimentation of our approach is presented here.In general, standard association rule mining algorithms result in enormous patterns, and users are expected to shortlist or select the patterns that are interesting to their own businesses.However, from the results, it is clear that our approach, in contradiction to traditional association rule mining algorithms, generates only a meager number of interesting association patterns that are both statistically and semantically important for business development.
Most of the algorithms for weighted and utility association rule mining existing in the literature employ weight and utility measures with support thresholds in frequent pattern mining.These algorithms are likely to miss the itemsets that are of less utility but are of high frequency.This in turn affects business utility because, in most cases, frequency plays a vital part in business development including sales backup and more.
On the whole, different businesses necessitate different levels of interestingness and utility.Therefore, an approach that outputs patterns specific to each criterion is highly valuable to users from different types of businesses.In order to facilitate this specificity, we have incorporated multiple levels of refinement (frequency, significance weight, utility, and interestingness weight), each useful for different types of business.Finally, the ultimate step of our approach, interestingness, is likely to generate novel association patterns by correlating the utility weight and frequent patterns.These novel patterns may or may not be frequent but could be very interesting to users for business development.

Comparative Analysis
The analysis of the results obtained from the comparison of the proposed approach with standard ARM (Agrawal et al., 1993), weighted ARM (Muyeba et al., 2008), and weighted utility ARM (Khan et al., 2008) is presented in this sub-section.This sub-section provides the comparative results of our approach and a brief description validating the significance of this approach when compared to the existing works of Muyeba et al. (2008) and Khan et al. (2008).Table 9 depicts the different patterns resulting from standard ARM (Agrawal et al., 1993), weighted ARM (Muyeba et al., 2008), and weighted utility ARM (Khan et al., 2008), each with 30% weight, and the proposed approach (high utility and interesting patterns).From the above table, it is clear that there is a steady decrease in the number of entries in each column of the table .The reduced number of patterns shows that the incorporation of distinct measures (weight, utility, etc.) into association rule mining brings about a progressive refinement of the quality of the discovered patterns.Khan et al. (2008) have shown that weighted utility ARM (WUARM) is better than weighted ARM because WUARM considers all possible patterns and then uses the items' weight and utilities for refinement.The patterns with a highlighted background in column 6 are the novel interesting patterns discovered using the efficient approach presented.An investigation of the patterns discovered by standard ARM, weighted ARM, WUARM, and our approach shows that our approach neglects some significant but invaluable patterns and brings about optimality between weight and utility measures.
For instance, pattern B (0.51) qualified in WUARM is not opted in our approach because even though the pattern is significant, it does not signify to be valuable i.e.) profitable.Also, the high utility patterns of our approach can be seen as an optimal combination of the patterns discovered in weighted ARM and WUARM.Furthermore, the high utility patterns discovered in historical buying patterns certainly signify the importance of the items in the growth of the enterprise.In addition to utility, we have incorporated the subjective interestingness of the users for pattern extraction.The set of novel interesting patterns mined based on the user's subjective interestingness depicts the buying patterns of the future and can serve as an effective means to increase business development.

CONCLUSION
Association rules have been utilized extensively to determine customer buying patterns from market basket data.In recent times, researchers have been greatly interested in incorporating utility considerations into association rule mining.Recently, the data mining community has turned to the mining of interesting association rules to facilitate business development by increasing the utility of an enterprise.The above scenario emphasizes the need to discover interesting and utility sentient association patterns that are both statistically and semantically important to business development.In this paper, we have presented a novel approach for mining high utility and interesting association patterns from transaction databases.This approach is aimed at mining association patterns that facilitate improvement of business utility.It focuses on the utility, significance, and interestingness of individual items for the mining of novel association patterns.The mined interesting association patterns are used to offer valuable suggestions to an enterprise for intensifying its business utility.The experimental results and analysis demonstrate the effectiveness of our approach in mining utility sentient and interesting association patterns.

Table 1 .
Profit and significance weight of items

Table 2 .
Customer transactions database

Table 4 .
Utility weight of frequent patterns

Table 6 .
Interestingness weight of newly formed patterns

Table 7 .
Patterns selected from newly formed patterns

Table 8 .
Novel interesting patterns