A Triad Percolation Method for Detecting Communities in Social Networks

For the purpose of detecting communities in social networks, a triad percolation method is ­proposed,­which­first­locates­all­close-triads­and­open-triads­from­a­social­network,­then­a­ ­specified­close-triad­or­open-triad­is­selected­as­the­seed­to­expand­by­utilizing­the­triad­ ­percolation­method,­such­that­a­community­is­found­when­this­expanding­process­meet­a­ ­particular­threshold.­This­approach­can­efficiently­detect­communities­not­only­from­a­densely­ social­network,­but­also­from­the­sparsely­one.­Experimental­results­performing­on­real-world­ social­benchmark­networks­and­artificially­simulated­networks­give­a­satisfactory­­correspondence.


Introduction
Many real-world systems in nature and society can be described as complex networks or graphs (Cui, Wang & Li 2014;Mu et al. 2014;Shen et al. 2009;Zhang & Wang 2015), such as the Internet, World Wide Web, social networks, biological networks, in which the nodes denote the entities and the interactions between entities can be described as edges. Community structure or clustering structure is one of the most common features of complex networks, where a group of nodes exhibit dense connections, meanwhile exhibiting comparatively sparse connections with the rest of the network (Ahn et al. 2009;Mu et al. 2014;Shen et al. 2009;Zhang & Wang 2015). Understanding community structure of social networks is critical due to its broad applications such as friend recommendations, user modeling and content personalization. However, the quantitative definition of community has been widely discussed by scholars from different areas. Until now, there is still no well-accepted quantitative definition (Liu et al. 2013). And there is no unique and widely accepted goodness measure of community quality in literature. In most cases, the community structure is often dependent on the specific application scenario. For instance, communities in a social network represent the groups of people with similar interests and talking topics; communities in biological networks stand for the modules of biological tissues with similar functions; communities in a protein network elaborate the a set of proteins with similar interaction function; communities in a web network are considered to be clusters of web documents with related topics (Liu et al. 2013;Zhao et al. 2018). Informally, a good community is a densely-connected group of nodes that is sparsely connected to the rest of the network. Extensive research has been devoted to designing community detection algorithms to uncover communities with the goal to minimize or maximize structural metrics, such as modularity, triangle participation ratio, or conductance of the discovered communities (Wagenseller & Wang 2018).
Over the past decade community detection has been applied to many real-world areas such as biological networks, protein networks, web graphs, VLSI design, social networks, and task scheduling. According to the motivations, application scenarios, the criterion of whether to allow overlapping and technical principles that existing methods adopted, the representative inspirations could be divided into two big classes: overlapping community detection and disjoint community detection (Leskovec, Lang & Mahoney 2010;Xie, Kelley & Szymanski 2013;Xu et al. 2016). Disjoint community detection focuses on the division of the boundaries between communities, and the sole ownership of each node. Overlapping community detection focuses on the distribution of the membership of each node, and the entire whole network structure. Zhang, Z, et al. 2018. A Triad Percolation Method for Detecting Communities in Social Networks. Data Science Journal, 17: 30, pp. 1-12. DOI: https://doi.org/10.5334/dsj-2018 Disjoint community detection was reviewed and categorized into five research lines, researchers and scholars used numerous techniques such as spectral clustering, modularity maximization, random walks and statistical mechanics to discover a community as a set of nodes that has more links between its members than with the remainder of the network (Leskovec, Lang & Mahoney 2010). Overlapping community detection are reviewed and categorized into five categories: Clique Percolation, Line Graph and Link Partitioning, Local Expansion and Optimization, Fuzzy Detection, Agent-Based and Dynamical Algorithms (Xie, Kelley & Szymanski 2013), which are investigated based on the consensus that people in a social network are naturally characterized by multiple community memberships (Xu et al. 2016).
However, the real-world networks, especially online social networks such as Sina, Twitter, and Facebook, are highly sparse, and communities in those networks are often small, comparing dozens or even hundreds of members (Han & Tang 2015;Leskovec et al. 2008). Intuitively, the latent members usually centered on a hub-node, which often are opinion leaders or activities, but the links between members are relatively seldom, and hub-nodes often form three tuples with other members, which all lead to sparse network topological structure and community distribution. Therefore, the traditional clique-based community detection approaches cannot work efficiently.
It is important to note that different community detection algorithms often tend to perform significantly well or poorly on certain kinds of networks. Moreover, one might prefer to choose specific algorithm with respect to the target application or the characteristics of a network that to be analyzed. Thus, it is necessary to analyze the real situations and application scenarios (Zhao et al. 2018). Thereby, without considering the individual nodes and edges, each node inhabits a triad (detailed in section 3.1). Inspired by previous related studies, an efficient and functional algorithm TPM (Triad Percolation Method), a method like CPM (Palla et al. 2005), is proposed. This method is more universal than the approach presented in Ref. (Fagnan, Zaïane & Barbosa 2014), which only considers the close-triad (3-clique).
The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 introduces motivations and outlines the algorithm TPM. In section 4 purpose experimental results are reported and the final section offers concluding remarks and sheds light on future research directions.

Related work
In the last decade, community detection can be claimed as a remarkable achievement in theory and in practical, especially in the field of social sciences computing. Many community detection algorithms have been developed. We organize a brief discussion on related works into two parts: the first is devoted to common node-based and link based community detection algorithms, and the second concerns the methods related to the idea of this article.
The node-based and link-based community detection methods are more widely used in detecting communities from networks. The node-based methods (Conrad 2010;Cui, Wang & Li 2014;Fagnan, Zaïane & Barbosa 2014;Lancichinetti, Fortunato & Kertész 2009;Lusseau et al. 2003;McAuley & Leskovec 2012;Mu et al. 2014;Newman 2006;Palla et al. 2005;Shen et al. 2009;Wang 2011;Zhang & Wang 2015;Zachary 1977) directly split the node set of a network into different partitions. Many excellent approaches for that purpose, such as CPM (Palla et al. 2005), GCE (Conrad et al. 2010), SLPA (Xie, Szymanski & Liu 2011), are proposed. The local optimization strategy is adopted by many of those algorithms. Besides, such algorithms like EAGLE (Shen et al. 2009), LFM (Lancichinetti, Fortunato & Kertész 2009), not only can discover overlapping communities, but also can identify the hierarchical organizations from a social network. As for the main idea of link-based techniques is taking advantage of the link information rather than node information to detect communities as following described (Ahn, Bagrow & Lehmann 2010;Shi et al. 2013;Ye et al. 2011). First, the link-based algorithms cluster on links of an original network. Then, link communities are mapped into node communities by gathering nodes that incident to all links within each link community (Ahn, Bagrow & Lehmann 2010;Shi et al. 2013;Ye et al. 2011), which just like the node-based community detection progress mentioned above.
In past years, a lot of approaches, employing different types of heuristics and a wide variety of criteria to optimize, have been proposed. Detailed surveys describing these methods can be found in (Aggarwal & Subbian 2014;Amelio & Pizzuti 2014;Fortunato 2010;Fortunato & Castellano 2012;Peel, Larremore & Clauset 2017;Pizzuti 2018). Reichardt and Bornholdt (2006) developed an approach based on finding the minimum-energy state of an infinite-range Potts spin glass, authors have proved an equivalence between minimizing the Hamiltonian and finding the maximum modularity partition of a network, and next to Reichardt and Bornholdt (2006), Delvenne et al. (2010) introduced a measure called clustering stability which generalizes modularity, normalized cut objective and Fiedler's spectral clustering approach for certain values of input parameters (Delvenne, Yaliraki & Barahona 2010;Veldt, Gleich & Wirth 2018).
Significantly, Palla et al. (2005) proposed a Clique Percolation Method that identifies communities by rolling a k-clique around the network until it is unable to reach any unexplored nodes (Fagnan, Zaïane & Barbosa 2014;Palla et al. 2005). As all we know, the k-clique rarely occurs in online social networks, such as Sina, Facebook and Twitter, hence the CPM does not effectively identify communities from sparsely online social networks. Correspondingly, Fagnan et al. (2014) attempted to remedy the problem that traditional methods cannot be extended to large scale datasets by using triads and the T metric they proposed, and the method they proposed contains two steps: 1) First, initial communities are detected by applying T metric within a local community detection framework; 2) Then, an additional measurement is employed to identify outliers/hubs in the discovered initial communities. However, this triad-based method presented in Ref. (Fagnan, Zaïane & Barbosa 2014) aim at the situation that communities are tightly-knit groups of nodes that interact more frequently within the group than outside of the group, and performs badly in the sparse online social networks. Irrespective of single nodes or single edges of a network, each node must be contained in a triad, close-triad or open-triad. Thereby, we put forward the Triad Percolation Method by combining the above problems for community detection in large scale sparse online social networks, which combines the motivations of CPM (Palla et al. 2005) and the idea presented in Ref. (Fagnan, Zaïane & Barbosa 2014).

Motivations of TPM
As we all know that the real-world social networks have a sparse feature and exponential-law degree distribution. Hence, the k-cliques (Palla et al. 2005) or maximal cliques (Bron & Kerbosch 1973) are not extensively resides in the real situations. Irrespective of the single nodes or single edges of a network, each node must be contained in a triad, close-triad or open-triad. The extensive researches suggest that close-triads are often located in the center of a community, and open-triads are usually situated on the border of that community (Palla et al. 2005).
For the purpose of detecting communities from social networks, first find out all of the triads, then an initial community is formed by expanding a seed close-triad or an open-triad by adopting the strategy that similar to the CPM (Palla et al. 2005). Eventually, we also iteratively extend an initial community to a newer, larger community. A community is detected until there is no any initial community meets the specific requirement detailed in formula (4).
Under the premise that two communities share common nodes and one community contains relatively less nodes, while another community possesses more nodes, thereby these two communities will possess strong coupling strength due to the small community is usually sub-community of the bigger community (Wang 2011;Zhang & Wang 2015). Accordingly, the two communities will merge into a newer, larger community. Formula (1) and (2) therefore are adopted (Wang 2011).
Where N(C i ) and |N(C i )| denote the node set and the amount of nodes for community C i , respectively. The more links that connect all nodes of two communities with bigger community clustering coefficient (Cui, Wang & Li 2014), the higher probability they belong to the same community, thus they will be incorporated into a larger community, which makes the formula (3) Where K ij indicates the total amounts of nodes for community C i and C j , number of links that connect K ij nodes is denoted by E ij , and these K ij nodes will have K ij (K ij -1)/2 links as a complete graph while they have E ij links in fact. During the emerging process, we need to select a pair of similar communities which belonging coefficient (detailed in formula (4)) meets the requirement of greater than a specific threshold α, where the belonging coefficient for evaluating the degree that C j belongs to C i . The TPM is a heuristic algorithm, which lacking of global community partition, so we calculate the Harmonic mean value instead of calculating the arithmetic mean value to measure whether a community C j belongs to another C i .

The algorithm of TPM
Inspired by the empirical analysis and observations mentioned above, the algorithm TPM is summarized as follows Table 1.

TPM time-complexity analysis
The algorithm time-complexity is now analyzed, assuming n as the total number of nodes of a social network G, m and s as the amount of close-triads and the number of open-triads, respectively. In steps 1 and 2, we simply iterating the each node in social network G, the O(n 2 ) operations are needed. As for step 5 to 28, the main manipulations are query and match, so when we select a triad as a seed community to expand, there needs O(m 2 ) query and match operations for a close-triad expanding, and O(s 2 ) corresponding manipulations for an open-triad seed expanding. Noticing the fact that many real-world social networks have the sparse

comm_lst.append(itm_opn)
41. Convert the items in comm_lst into the corresponding node set // initial communities generation end here 42. A pair of initial communities with, and are selected from comm_lst, merged into a newer, larger community. Repeat this process until there is no any pair of communities meet the requirement of. BC (C i , C k ) > α.

Output the result communities in comm_lst
characteristic, hence the variables m, n and s meet the condition of m ≤ s ≤ n that this process requires O(n 2 ) at most. Come to step 31, this conversion obtains O(n) operations. The merge manipulation of step 32 is an iterative process, which demands O(n 2 ) operations at most. In summary, the overall time-complexity of algorithm TPM is O(n 2 ), a time-consuming approach, and its performance will be improved in our future work.

The estimation of belonging coefficient threshold α
In order to making TPM adapt to different social network settings, we also proposed an approach to estimate the belonging coefficient threshold. Suppose LSP = {v 1 , v 2 , … , v k } is the largest shortest path of social network G, we get the average clustering coefficient LACC of nodes within the LSP, i.e.
. And we also assume that ACC illustrates the average clustering coefficient of social network G. Finally, we use the harmonic mean value hmcc of LACC and ACC as the estimation of belonging coefficient threshold α, i.e. hmcc = (2*LACC*ACC)/(LACC + ACC). The reason why we use the strategy mentioned above is motivated by the idea of formula (1) to (4), which are based on local aggregation and expansion. Although this method may not provide a pretty accurate threshold, it can furnish an approximate reference value automatically.
Above all, the calculated reference threshold serves as a benchmark for fine-tuning, we can also run the application of TPM more than once, and we set the tuned threshold that maximizes the following modularity Q as the final threshold α (Newman 2006).

Experiments and analysis
In this section, the performance of TPM is evaluated on four real-world social networks and an artificially simulated network, Zachary Karate Club benchmark network (Zachary 1977), Bottlenose Dolphins network (Lusseau et al. 2003), Facebook ego network (McAuley & Leskovec 2012), the Cora 1 citation network, and the LFR (Lancichinetti, Fortunato & Radicchi 2008) benchmark network, their statistical properties and specific settings are listed as following Table 2. Although the real social networks Karate and Dolphins are very small, they can serve as the benchmark networks due to their community structure was known beforehand, and so as to be able to evaluate the accuracy of the tested algorithm.

Zachary's Karate Club network
The Zachary's Karate Club network (Zachary 1977), the first real-world social network tested was widely used as a benchmark for evaluating the performance of community detection algorithms. This network is an un-weighted network with 34 members of a karate club as nodes and 78 links representing friendships among members of the club. However, due to the conflict between the club administrator and the club main instructor, the members were split into two different groups centering on the administrator and the instructor, respectively. As shown in Figure 2, the community organization is plotted, where node 1 and 33 represent the administrator and instructor, respectively. Here, we set the belonging coefficient threshold as α = 0.35 for TPM. Then, this network was naturally partitioned into 2 communities, which are in accord with the actual situation, as shown in Figure 2(a). Three overlaps, node 3, 9 and 10, were extracted. For node 10, it has only two edges, which connect two different communities, and the probability that they belong to each community are equal, hence node 10 as an overlap is reasonable. As for node 9, it forms three close-triads, which are {9, 3, 1}, {9,33,34} and {9,31,33}. Then,{9,33,34} and {9,31, 33} belong to same community, but {9, 3, 1} belongs to another. However, so strong strength of triangle stability that makes node 9 constructs a stable structure. Therefore, node 9 plays a role of overlap is reasonable, and node 3 obtained the similar situation. The performance is better than CPM (Palla et al. 2005) and GCE (Conrad et al. 2010).
However, the communities discovered by CPM (k = 3), just as shown in Figure 2(b), are not meet the real situation that two communities centered by node 1 and 33 respectively. Besides, the node 10 and 12, which also can be called as 'lost guys', are not contained into the detected communities. Noticing that the number of 'lost guys' will be more as the increasing of clique size k for CPM (Palla et al. 2005), especially for large, sparse social networks. But, the proposed algorithm TMP will be better able to handle those cases.

The Bottlenose Dolphins network
The second real-world benchmark network examined was the Bottlenose Dolphin Network (Lusseau et al. 2003), the 62 nodes are the bottlenose dolphins (genus tursiops) of a bottlenose dolphin community living off Doubtful Sound, a fjord in New Zealand (spelled fiord in New Zealand). An edge indicates a frequent association. The dolphins were observed between 1994 and 2001. This network is an un-weighted and undirected network with 62 nodes and 159 links.
For this case, we assigned the belonging coefficient threshold α = 0.25. As shown in Figure 3(a), the TPM experimental correspondence is almost perfect, the community partitions and overlapping nodes are reasonable and accord with the real situation. Hoverer, the communities detected by CPM (k = 3), as shown in Figure 3(b), is just passable, which not only inconsistent with the actual situation, but also produced many 'lost guys', such as node 4, 11, 12, and 58. And that situation will be more serious as the increasing of clique size k for CPM. But the proposed TPM will works well for such sparse networks.

The Facebook network
This section introduces a dataset of ego-network from Facebook. This dataset adopt from Ref. (McAuley & Leskovec 2012), users in Facebook identifying friends sharing a common attribute. Generally, there are two useful sources of data that help with locating those common attribute groups. We expect that communities are formed by densely-connected sets of users. In other words, the goal is to find nested as well as overlapping communities in this network (McAuley & Leskovec 2012).
As for this Facebook network setting, the belonging coefficient threshold was set to α = 0.16. During the experimental process, we also fine-tuned this parameter by step length of 0.05, and we acquired different community divisions, which demonstrated the clustering situation of network in different granularity. Through the empirical analysis, a relative good community partition for this Facebook network is cleared in Figure 4(a), which brought the network's distinctive community structure to light. Meanwhile, except for the 'lost guys', the CPM (k = 3) also got a good network partition. However, the CPM is hard to determine the clique size, and that also lost many nodes inevitably, especially for the large, sparse real-world social networks.

The Cora citation network
The Cora 1 dataset consists of 2708 scientific publications (nodes) classified into one of seven classes and 5429 links. The Cora dataset consists of Machine Learning papers. These papers are classified into one of the following seven classes: case_based, genetic_algorithms, neural_networks, probabilistic_methods, rein-forcement_learning, rule_learning and theory. The papers were selected in a way such that in the final corpus every paper cites or is cited by at least one other paper.
However, due to the large number of nodes and edges in Cora 1 network and the limited paper space, the specific community divisions are not plotted as shown in Figures 2, 3 and 4. Thus, we only described the statistics of the communities detected by the TPM and CPM, respectively. The CPM as a comparative method, the size k of clique is set to 3, 4 and 5, respectively. (1) When the value of k is 3, the CPM obtained 293 communities with 1469 nodes, and the remained 1239 nodes are not divided into corresponding  communities; (2) when the k is set to 4, the CPM obtained 89 communities with 393 nodes, and the remained 2315 nodes are not included in corresponding communities; (3) when the value k is 5, the CPM only acquired 7 communities with only 36 nodes. But, the TPM also got 300 communities with 2708 nodes, that is, all nodes are classified into corresponding communities. The reason for the above correspondences is that the cliques used in the CPM are closed structures, there are edges between all nodes, and seldom in online social networks, thus only few nodes can be clustered by the CPM. Fortunately, thanks to the TPM utilized the triads, which not only contains the close-triads similar to the 3-clique, but also contains the open-triads, thus combined with the statements in section 3.2, the TPM can classified all nodes into corresponding communities and more suitable for community discovery in sparse networks.

LFR benchmark network
This section compares the performance of our algorithm with that of the algorithm CPM (Palla et al. 2005), LFM (Lancichinetti, Fortunato & Kertész 2009) and CNM (Clauset, Newman & Moore 2004). The comparative experiments were run on the benchmark networks generated by the LFR software (Lancichinetti, Fortunato & Radicchi 2008). The LFR software, released by Lancichinetti and his colleagues (2008), can generate artificial simulation networks with benchmark community and corresponding statistical information, which has been widely applied in the field of community detection algorithm evaluation. However, the parameters of LFR employed in this article were set as follows: number of nodes N = 500,000, average degree k = 500, maximum degree maxk = 1000, minimum community size minc = 1000, maximum community size maxc = 1000, and the mixing parameter mu from 0.1 to 0.9 was adjusted, where the mixing parameter is not the average ratio of external degree/total degree (as it used to be) but the maximum (or the minimum) of that distribution. Considerably, the greater the value of mu, the harder it is to detect corresponding communities. Thus, nine types of benchmark networks were obtained, each type generating ten networks by using the same parameters and averaging the test results. Meanwhile, we take advantage of NMI (Normalized Mutual Information) as a criterion to measure the performance of these four algorithms (Lancichinetti, Fortunato & Kertész 2009), and the NMI is widely used to evaluate the difference between the results of community detection and the benchmark community division in the field of network analysis. The experimental consequences, as shown in Figure 5, indicate that TPM performance is just slightly lower than the other three algorithms when the mu is less-than-or-equal to 0.5. However, when mu is greater Figure 5: Comparative experiments on benchmark networks generated by LFR software. than 0.5, the algorithm TPM performs better than the remains, even though the performance decreases with the mixing parameter mu increasing, which is due to the fact that TPM can cluster many nodes with the degree of 1, and the remaining algorithms do not have the above advantages.

Conclusions and future directions
In this paper, a triad-based community detection algorithm TPM was proposed. First, all triads which include close-triads and open-triads were found. A seed close-triad or open-triad was selected to expand to a basic community. Finally, the two communities with belonging coefficient greater than α were merged into a newer community, until there is no any pair of communities with higher belonging coefficient than α. The experimental results examined on different real-world networks and artificially simulated benchmark network indicate that our method is useful and effective.
The biggest advantage of the algorithm TPM is that it not only discovers communities from networks with typical community structures, but also detects communities from sparse networks with many open triads. We can obtain different granularity community structure by adjusting the belonging coefficient α, and so as to achieve the purpose of analyzing the network topology from diverse aspects. Meanwhile, due to the TPM is based on the idea of triads (triples), it can be easily implemented on a distributed graph processing platform for large scale network, such as GraphX on the Spark platform.
However, the drawback of the algorithm TPM is its large time-complexity O(n 2 ), hence there are at least two directions in which could be extended for this work. First, due to the TPM is so extremely time-intensive that additional improvements are needed in future work, one of the approaches we adopted is to implement the TPM by using the GraphX package provided by Spark. Besides, another drawback of TPM is the poor efficiency detection of overlapping communities in large-scale LFR benchmark networks, especially in term of recall, precision and F-value of detected overlaps, however the detailed underlying reasons have been analyzed and corresponding improvements were made in overlapping community detection in our next paper. Finally, an application of the TPM with weighted and directed networks is also needed for future research issues.