A ROUGH SET APPROACH FOR CUSTOMER SEGMENTATION

Customer segmentation is a process that divides a business’s total customers into groups according to their diversity of purchasing behavior and characteristics. The data mining clustering technique can be used to accomplish this customer segmentation. This technique clusters the customers in such a way that the customers in one group behave similarly when compared to the customers in other groups. The customer related data are categorical in nature. However, the clustering algorithms for categorical data are few and are unable to handle uncertainty. Rough set theory (RST) is a mathematical approach that handles uncertainty and is capable of discovering knowledge from a database. This paper proposes a new clustering technique called MADO (Minimum Average Dissimilarity between Objects) for categorical data based on elements of RST. The proposed algorithm is compared with other RST based clustering algorithms, such as MMR (Min-Min Roughness), MMeR (Min Mean Roughness), SDR (Standard Deviation Roughness), SSDR (Standard deviation of Standard Deviation Roughness), and MADE (Maximal Attributes DEpendency). The results show that for the real customer data considered, the MADO algorithm achieves clusters with higher cohesion, lower coupling, and less computational complexity when compared to the above mentioned algorithms. The proposed algorithm has also been tested on a synthetic data set to prove that it is also suitable for high dimensional data.


INTRODUCTION
Customer Relationship Management (CRM) is a business methodology used to build a relationship with long term profitable customers by analyzing customer needs and behaviors.It is an important technology in every business because business is customer centric.CRM helps business leaders gain insight into customer behavior and life time value to increase profit by acting according to the customer characteristics.Customer segmentation plays an important role in CRM.It divides customers into groups according to their purchasing behavior, allowing business leaders to design and establish different strategies for each group of customers and thus maximize their value (Ling & Yen, 2001).Recency (R), Frequency (F), and Monetary (M) are the attributes chosen for describing the purchasing customer characteristics.The RFM model works very effectively in customer segmentation (Wu & Lin, 2005).R indicates the time interval between the present and previous transaction dates of a customer.F indicates the number of transactions that the customer has made in a particular interval of time.M indicates the total value of the customer's transaction (Cheng & Chen, 2009).In this paper, the modified RFM model, called RFMP, is introduced.This model considers the customers' payment details.P indicates the average time interval between payment and purchase date.The customers' payment details are an important attribute because any two customers with the same R, F, M values but a different P value cannot be treated equally by the enterprise.The RFMP model ensures that the customer segmentation is done objectively.The values for R, F, M, and P attributes are continuous.These continuous values are normalized to categorical values as being very low, low, middle, high, and very high for effective analysis.The data available for customer segmentation is now categorical data.The data mining clustering technique is widely used to accomplish customer segmentation (Ngai, Xiu, & Chau, 2009).
The clustering technique is used to segment customers in such a way that the customers in one group behave similarly when compared to the customers in other groups based on their transaction details.The traditional approaches for clustering, such as partitioning and hierarchical algorithms, deal with numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points.Distance functions, such as Manhattan, Euclidean, and Minkowski, are used for allocating a data point to the appropriate clustering.However, customer data available for clustering is categorical so the above procedures are not feasible.The computation of similarity or dissimilarity is essential for categorical data (Chen, Chuang, & Chen 2008).
Simple matching, co-occurrence, probabilistic and distance hierarchy are the approaches for computing similarity or dissimilarity measures (Cao, Liang, Li, Bai, & Dang, 2011).
Simple matching is a common approach in which the comparison of two identical categorical values yields either zero or one.k-Modes, fuzzy k-Modes, and k-prototype algorithms are based on simple matching.These algorithms produce clusters with weak intrasimilarity and have stability problems (Cao, Liang, Li, Bai, & Dang, 2011).ROCK (Robust clustering using links) and CACTUS (Clustering categorical data using summaries) are algorithms based on the co-occurrence approach.ROCK uses the concept of a link to measure the similarity between categorical patterns.Here link is defined as the number of common neighbors between two patterns (Guha, Rastogi, & Shim, 2000).CACTUS calculates the frequency of two values appearing in the patterns together.It finds clusters in subsets of all attributes and thus performs subspace clustering.It generalizes the cluster definition of numerical data so that it is suitable for categorical data (Ganti, Gehrke, & Ramakrishnan, 1999).The limitations of ROCK are that it is sensitive to threshold value and sometimes does not produce the required number of clusters (Parmar, Wu, & Blackhurst, 2007).Probabilistic approaches use conditional probability estimation to define relations between clusters.COBWEB, AUTOCLASS, DECA, and COOLCAT are based on probabilistic models and require a long training time.COBWEB uses the category utility function, and AUTOCLASS uses a Bayesian method to derive probable class distribution.DECA is a discrete valued clustering algorithm, and COOLCAT is an entropy based algorithm.Distance hierarchy associates each link with a weight and requires the domain experts to incorporate knowledge (Cao, Liang, Li, Bai, & Dang, 2011).Rough set theory (RST) by Pawlak (1982) received a great deal of attention for dealing with categorical data in clustering algorithms due to its stable results and no requirement of domain expertise.It uses global data properties to establish similarity between the objects (Bean & Kambhampati, 2008).
The existing clustering algorithms (Parmar et al., 2007;Mazlack, He, Zhu, & Coppock, 2000;Kumar & Tripathy, 2009;Tripathy & Ghosh, 2011a;Tripathy & Ghosh, 2011b;Herawan, Deris, & Abawajy, 2010;Herawan, Ghazali, Yanto, & Deris, 2010) based on RST utilize the correlation of attributes.If the attributes are dependent, then the clustering algorithms based on the correlation of attributes produce correct results.The attributes R, F, M, and P chosen for describing the purchasing characteristics of customers are independent so that there is no proportional relationship between the attributes.This concept led to the proposal of MADO (Minimum Average Dissimilarity between Objects), a clustering algorithm based on RST.The MADO algorithm calculates the dissimilarity between objects without considering the dependency between attributes.
The rest of the paper is organized as follows: Section 2 discusses the basic concepts of rough set theory.Section 3 summarizes rough set theory based clustering algorithms.Section 4 explains the MADO clustering algorithm.Section 5 compares the clustering results obtained for real customer data and synthetic data.Finally, Section 6 provides concluding remarks.

BACKGROUND
Rough set theory (RST) by Pawlak classifies imprecise, uncertain, or incomplete information or knowledge expressed by data acquired from experience (Pawlak, 1982).It gained importance in the areas of machine learning, knowledge acquisition, decision analysis, knowledge discovery from databases, expert systems, decision support systems, inductive reasoning, and pattern recognition (Pawlak, 1992).It is suitable for processing qualitative information that is difficult to analyze by standard statistical techniques.It manages vague and uncertain data or problems related to information systems (Shyng, Wang, Tzeng, & Wu, 2007).
The information system is a 4-tuple (quadruple) S = (U, A, V, f), where U and A are a non-empty finite sets of objects and attributes, respectively, and V is the set containing the domain of each attribute, where V a denotes the domain of attribute a that belongs to A. The function f: U  A → V is a total function such that f(u, a) V a , for every (u, a)U  A and is called the information function.RST is mainly based on the indiscernibility relation, equivalence class, lower approximation, and upper approximation (Pawlak & Skowron, 2007). is the union of all the elementary sets with respect to B that are contained in X.The upper approximation (X UB ) is the union of all the elementary sets with respect to B that have a non-empty intersection with X. Eqs. ( 1) and ( 2) calculate the lower and upper approximation, respectively.
The lower approximation consists of all objects that definitely belong to the concept while the upper approximation contains all objects that possibly belong to the concept.The difference between the upper and the lower approximation constitutes the boundary region of the vague concept.Approximations are two basic operations in rough set theory; thus it expresses vagueness not by means of membership but by employing a boundary region of a set.If the boundary region of a set is empty, the set is crisp.Otherwise the set is rough (inexact) (Pawlak & Skowron, 2007).The ratio of the cardinality of the lower approximation and the cardinality of the upper approximation is defined as the accuracy of estimation, a measure of roughness (Pawlak, 1992).Many clustering algorithms based on RST have been developed, and they are overviewed in the next section.Mazlack et al. (2000) proposed two techniques based on rough set theory to select clustering attributes.These are bi-clustering (BC) and total roughness (TR) techniques.BC is applicable only for bi-valued attributes.TR handles multi-valued attributes and is based on the total average of the mean roughness of an attribute with respect to the set of all attributes in an information system.The attribute with the higher TR is chosen as the clustering attribute.However, for partitioning, the method starts with binary valued attributes and uses the total roughness criterion only for multi-valued attributes.Therefore, partitioning is done on a multi-valued attribute only when all the binary valued attributes have already been partitioned.This reduces the efficiency of the algorithm because the partitioning is done on a binary valued attribute even when the total roughness value for the multi-valued attribute is high.The roughness, mean roughness, and total roughness in TR are calculated using Eqs.( 3), (4), and (5), respectively.

RELATED WORK
X indicates the set of objects in the data set; X i indicates the set of objects in the sub-partition of attribute I; n indicates the number of sub-partitions of attribute I, and m indicates the number of attributes.In Eq. ( 3), X LB and X UB are calculated using Eqs.( 1) and (2), respectively.In Eq. ( 4), Rough(i) is the mean roughness of all sub-partitions of attribute i.In order to choose the partitioning attribute i, the Total roughness(i) towards all the attributes is calculated using Eq. ( 5).
The attribute with the highest total roughness is chosen as the clustering attribute.However, for partitioning, the method starts with binary valued attributes and uses the total roughness criterion only for multi-valued attributes.This creates a disadvantage due to the fact that the partitioning is done on a binary attribute even though the total roughness for a multi-valued attribute is higher.Parmar et al. (2007) proposed a new technique called min-min roughness (MMR) for multi-valued attributes.The MMR technique is based on the minimum value of mean roughness and does not require total roughness for calculating as in TR.The roughness and mean roughness in MMR is calculated using Eqs.( 6) and ( 7), respectively.
Given a i , a j  A (set of attributes), V( ) a i is the set of values of attributes a i , and then a i  a j .X is a subset of objects having one specific value  for the attribute a i ,, that is X ( is the lower approximation of X with respect to {a j }, and ( is the upper approximation of X with respect to {a j }.Thus, (X) R a j is defined as the roughness of X with respect to {a j }, which is given by Eq. ( 6).Also, the mean roughness on attribute a i with respect to {a j } is defined in Eq. ( 7). ( The maximum value of roughness is one because the number of objects in the lower approximation is less than or equal to the number of objects in the upper approximation.The min-roughness (MR) of each attribute refers to the minimum of the mean roughness with respect to a single attribute.The min-min-roughness (MMR) is defined as the minimum MR of n attributes.The MMR value determines the splitting attribute.The node with the maximum number of elements is chosen for further splitting.From the experimental analysis in Parmar et al. (2007), MMR achieves better results when compared to BC, TR, fuzzy set based algorithms, and other dissimilarity approaches.Kumar and Tripathy (2009) proposed MMeR by making a slight alteration in MMR.In their algorithm, the mean-roughness (MeR) of each attribute is calculated as the mean of the mean roughness.Then min-mean-roughness (MMeR) is defined as the minimum of MeR of n attributes.The MMeR value determines the splitting attribute.The node that has the maximum average distance between its elements is chosen for further splitting.
Tripathy and Gosh (2011a) proposed an algorithm for clustering categorical data where the standard-deviation-roughness (SDR) of each attribute is calculated as the standard deviation of the mean roughness.The attribute with the minimum SDR is chosen as the splitting attribute.
Tripathy and Gosh (2011b) also proposed an algorithm called SSDR.Here the standard deviation of SDR (SSDR) is calculated, and the splitting attribute is chosen based on this value.The node chosen for further splitting in the SDR and SSDR algorithm is the same as that of the MMeR algorithm.SSDR achieves better results than SDR for small data sets while for large data sets it achieves the same results as SDR.The roughness used in MMeR, SDR, and SSDR is calculated using the same formula as in MMR.All the algorithms TR, MMR, MMeR, SDR, and SSDR in the above examples (Parmar et al., 2007;Mazlack et al., 2000;Kumar & Tripathy, 2009;Tripathy & Ghosh, 2011a;Tripathy & Ghosh, 2011b) have been tested on data sets available in the UCI machine learning repository.The purity of clusters was used as a measure to test the quality of the clusters.The purity ratios of MMR, MMeR, SDR, and SSDR are in increasing order.All these clustering algorithms are based on the roughness of an attribute with other attributes.Herawan et al. (2010a) proposed a new technique called maximal dependency attributes (MDA) for selecting clustering attributes.Based on MDA, Herawan et al. (2010b) proposed MADE for categorical data clustering.MDA selects the clustering attribute by determining the dependencies between attributes.The attribute with the maximum degree of dependency is selected as the partitioning attribute.

PROPOSED ALGORITHM
The calculation of roughness and determination of dependencies between attributes in real customer data are not appropriate due to the dynamic behavior of the customers.The MADO clustering algorithm overcomes this disadvantage by utilizing the equivalence class property of rough set theory.The splitting attribute is chosen based on the dissimilarity between the objects in the same equivalence class.Let A indicate the set containing the attributes in the data and B be another set containing the objects whose value for a particular attribute is same.Equivalence class is calculated for each attribute with a specific value.In each calculation, the set B contains the objects of that equivalence class.The dissimilarity between the objects within the set B is calculated using Eq. ( 8).
The average dissimilarity between the objects in an equivalence class or set B is calculated using Eq. ( 9).
Here x i , x j belong to the same equivalence class; n is the number of attributes in A; and V( , ) xx ij is the number of attributes having same value for the objects Here B indicates an equivalence class; m is the number of objects in set B, and x i , x j belongs to set B. The equivalence class B of an attribute a i having value v j is represented as [a ij ].Set B, which has the minimum average dissimilarity and has at least two objects, determines the splitting attribute a i on its value v j for the parent node.
Initially, the parent node contains all the objects.After partition, the leaf node having more objects is selected as the parent node for further partitioning in subsequent iterations.The algorithm terminates when it reaches a pre-defined number of clusters.The procedure for the MADO clustering algorithm is given in Figure 1.
Procedure (U,k) Line 1 Begin Line 2 Set current number of cluster CNC = 1 Line 3 Set k as required number of clusters Line 4 Set ParentNode = U Line 5 Do Line 6 For each a i from A (i = 1 to n, where n is the number of attributes in A) Line 7 For j=1 to l, where l is the number of different values in a i Line 8 In the ParentNode, determine family of equivalence classes for a i with value j which is denoted as set B Line 9 Calculate

EXPERIMENTAL RESULTS
In this section, real data sets of customer transaction details are used for clustering or segmenting the customers.Customer transaction details for a period of six months have been collected from four different enterprises.Data set1 consists of 47891 records; data set2 consists of 29790 records; data set3 consists of 34035 records; and data set4 consists of 24191 records.For each transaction, party id, date of purchase, amount of purchase, and payment of purchase are used to define R, F, M, and P values.The distinct party id represents the individual customer.For each distinct party id, R is calculated as the interval between the time that the latest consuming behavior occurred and the present; F is calculated as the number of transaction records; M is calculated as the total purchase amount; and P is calculated as the average time interval (in terms of days) between the payment date and purchase date for each transaction in the data set.The data set now has only four attributes, namely R, F, M, and P, for each customer.Dataset1 has 5062 customers; data set2 has 1420 customers; data set3 has 3811 customers; and data set4 has 2675 customers.The values of R, F, M, and P are normalized as given below: For normalizing R or P: 1) Sort the data set in descending order of R or P.
2) Divide the data set into five equal parts with 20% of the records in each.
3) Assign categorical values as very low, low, middle, high, and very high to the first, second, third, fourth, and fifth parts of the records, respectively.For normalizing F or M: 1) Sort the data set in ascending order of F or M 2) Divide the data set into five equal parts with 20% of the records in each.
3) Assign categorical values as very low, low, middle, high, and very high to first, second, third, fourth, and fifth part of the records, respectively.
The normalized data set is now used by the MADO clustering algorithm to segment the customers into various groups.MMR, MMeR, SDR, SSDR, and MADE algorithms are applied to the same four data sets so that the customers are divided into various groups.Criteria cohesion and coupling are used to measure the internal quality of the cluster (Santos, Heuser, Moreira, & Wives, 2011).Cohesion expresses the average similarity between the elements of a cluster.Coupling expresses the average similarity between all pairs of elements, where one element belongs to cluster C and the other does not.Ideally, cohesion should be high and coupling should be low (Kunz & Black, 1995).The formulas for cohesion and coupling for a cluster C are given by Eqs. ( 10) and ( 11), respectively.The total cohesion and total coupling of clusters are given by Eqs. ( 12) and ( 13), respectively.
Here sim(c i ,c j ) is the similarity score between elements c i and c j belonging to cluster C; sim(c i ,q j ) is the similarity between element c i from cluster C and element q j from another cluster; m is the number of elements in C; n is the number of elements outside C; and k is the number of clusters.The results of MMR, MMeR, SDR, SSDR, MADE, and MADO algorithms are compared by varying the number of clusters to be produced from three to seven.In each case, the total coupling and the total cohesion of the clusters are calculated using Eqs.( 12) and ( 13).The results of four data sets for all the five cases produced by the clustering algorithms are tabularized in Tables 1 through 8.The MADE algorithm produces the value zero for the degree of dependency calculated for each attribute.This is because the lower approximation of each attribute for each value is a null set.Therefore this algorithm could not be applied further to produce the clustering result for the considered data set.The results of SDR and SSDR are the same because for large data sets, SSDR achieves the same result as SDR (Tripathy & Ghosh, 2011b).In the synthetic data set obtained from the repository, the class label is available in the data set and so purity is used as a measure to test the quality of the clusters.The purity of a cluster 'i' is defined by Eq. ( 14).The overall purity is defined by Eq. ( 15).
Here, n i r indicates the numbers of objects occurring in cluster i and its corresponding class r; n r indicates the number of objects in the class r; and n indicates the number of clusters in the data set.A higher value of overall purity indicates a better clustering result, with perfect clustering yielding a value of 1.In the zoo data set, each object is classified as belonging to one of the 7 classes.The MMR, MMeR, SDR, SSDR, and proposed MADO clustering algorithms are executed to obtain 7 classes.Purity for each cluster is calculated using Eq. ( 14), and the overall purity is calculated using Eq. ( 15).The results obtained from the MMR, SSDR, and MADO algorithms are given in Tables 9, 10, and 11, respectively.From Table 9, it is observed that out of 101 objects, 92 objects are classified correctly.From Table 10, it is observed that out of 101 objects, 79 objects are classified correctly.From Table 11, it is observed that out of 101 objects, 95 objects are classified correctly.Thus the overall purity of the clusters obtained using MMR, SSDR, and the proposed algorithm is 0.91, 0.9079, and 0.9406 respectively.The overall purity of the clusters obtained using MMeR and SDR is 0.902 and 0.9079 respectively (Tripathy & Ghosh, 2011b).Thus, the proposed clustering algorithm produces good clusters for a high dimensional synthetic data set.
In the zoo data set, the MMR algorithm performs better when compared to the MMeR, SDR, and SSDR algorithms.
Next, the MMR and proposed clustering algorithms are used for the soybean data set.In the soybean data set, each object is classified as belonging to one of the 4 diseases or 4 classes.The MMR and proposed clustering algorithms are executed to obtain 4 clusters.Purity for each cluster is calculated using Eq. ( 14).The results obtained from the MMR and the proposed algorithm are given in Tables 12 and 13, respectively.From Table 12, it is observed that out of 47 objects, 39 objects are classified correctly.From Table 13, it is observed that out of 47 objects, 46 objects are classified correctly.Thus the overall purity of the clusters obtained using the MMR and the proposed algorithm is 0.8298 and 0.9787, respectively.The results obtained for the synthetic data set show that the proposed algorithm produces improved clustering results in terms of purity when compared to other clustering algorithms.Therefore, the proposed clustering algorithm can cluster data in a better way irrespective of the number of attributes considered.

CONCLUSION
Customer segmentation for CRM is achieved using a data mining clustering technique.This technique is tested in four various real data sets.Data related to customers are categorical in nature so the usual hierarchical and partitioning algorithms are not applicable.The importance of rough set theory for categorical clustering is discussed, and the algorithms based on this technique are studied.The proposed MADO algorithm considers the dissimilarity between objects in the same equivalence class.The cluster quality for a real data set is measured using cohesion and coupling.The experimental results for the considered four customer data sets show that the MADO algorithm produces clusters with high cohesion and low coupling values for all four cases with respect to other algorithms.The applicability of the MADO algorithm to high dimensional data sets has been proven by testing it with synthetic data sets.The experimental results show that the MADO algorithm clusters high dimensional data in a better way when compared to other clustering algorithms.In the future, the behavior or characteristics of customers in each segment can be analyzed so that marketers can employ different strategies for each group.Furthermore, the life time value of a customer can be increased by adopting suitable techniques for each customer segment.
Avg-Dissim = Min (avgdissim (B)) for each │B│>1 Line 13 Determine splitting attribute a i on its value v j (a ij ) corresponding to the Min-) = Count (Set of Elements in Cluster i) Line 22 Next Line 23 Determine Max (Size (i)) Line 24 Return (Set of Elements in cluster i) corresponding to Max (Size (i)) Line 25 End Figure 1.MADO algorithm The equivalence class ([x i ] Ind(B) ) is the set of objects x i having the same values for the set of attributes in B. This is also known as an elementary set with respect to B. The lower approximation (X LB )

Table 1 .
Total illustrate that the clusters produced by the MADO algorithm have on average high cohesion and low coupling values all five cases with respect to the other algorithms.Because the clustering algorithm performance depends on producing clusters with high cohesion and low coupling values, the proposed algorithm out performs the other rough set based clustering algorithms for the considered four real customer data sets.The reason behind this is that the proposed algorithm considers the dissimilarity between the objects in the same equivalence class when choosing the splitting attribute instead of finding the correlation between attributes as the other rough set based clustering algorithms do.The proposed clustering algorithm is also applicable for high dimensional data.It has been tested for soyabean and zoo data sets obtained from the UCI Machine Learning Repository.The soybean data set contains 47 objects with 35 categorical attributes.The zoo data set contains 101 objects with 18 categorical attributes.The purity of clusters is used to test the quality of the clusters if the class label is already known.In a real data set, because the class label is not known, cohesion and coupling are used to test the quality of the clusters.

Table 9 .
MMR output for the zoo data set

Table 10 .
SSDR output for the zoo data set

Table 12 .
MMR output for soyabean data set

Table 13 .
MADO output for soybean data set Cluster Number