Analyzing social networks has received a lot of reviews in the recent literature. Many papers have been proposed to provide new techniques for mining social networks to help further study this huge amount of data. However, to the best of our knowledge, none of them considered the

Social network sites, such as Facebook, have played a great role in our daily lives due to its ability to link users regardless of their geographical location (

A social network can be represented by a graph, in which users are the nodes in the graph and the relations (or friendship) between users are the edges (

A crucial aspect of such networks is the community detection problem (

One way to get the relatedness of two nodes is by calculating the geodesic distance between the nodes. The geodesic distance between any two given nodes is obtained by the number of hops between them. So, if the two nodes,

While geodesic distance provides a relatively good measure of the relatedness, it is not adequate to fully get a sense of how much a node knows the other node. To further enhance this measure, the interests of each node should also be taken into consideration to give a more accurate estimate of the relatedness between the nodes. For instance, node

The contribution of this paper is to devise a new algorithm to detect communities in a given social network based on the

The rest of the paper is organized as follows. Section II gives some of the related work on community detection. A brief description about the MapReduce model is given in Section III. In Section IV the GeoSim algorithm is introduced. Section V presents the parallel version of the algorithm, called GeoSimMR. The experiment results are discussed in Section VI. Finally, we conclude the paper in Section VII.

Many approaches have been proposed that divide a social network into communities. Rosvall and Bergstrom (

Blondel et al. (

Newman and Girvan (

Tantipathananandh et al. (

Ghosh and Lerman (

Zhang and Yu (

Altunbey and Alatas (

Qi et al. (

MapReduce (

In this paper, we propose a new algorithm, namely GeoSim, for clustering a social network into communities. At the beginning, the core algorithm is discussed and then we show our implementation of the algorithm on MapReduce framework (GeoSimMR).

Social networks can be represented as a graph

The issue here is how to define a metric that measures the relatedness between two nodes in order to cluster the network. Two users that are connected to each other should not necessarily be in the same community. For instance, assume users

Based on the above, the distance between the nodes and the similarity of their interests both have a great effect on how to cluster the social networks. The key idea behind the GeoSim algorithm is that it incorporates both the nodes interests as well as their distances from each other. In other words, two nodes that are close to each other in terms of distance and have very different interests have a very low probability of being in the same community. On the other hand, if two nodes have very similar interests but are far away from each other would still have a very low chance to be grouped in the same community. As a result, two nodes will be in the same community if and only if they are close to each other as well as have similar interests. GeoSim algorithm achieves this by using a weighted function that gives a score of how close two nodes are to each other.

As mentioned earlier, the GeoSim uses both distance and interest to determine the clusters in a social network graph. To measure the closeness of any two nodes, we called it the GeoSim score, equation (1) is used.

_{1}, _{2}) is the the GeoSim score between the two nodes _{1} and _{2} where _{1}, _{2} ∈ _{1}, _{2}) is the geodesic distance between the two nodes, _{1}, _{2}) is the similarity between the two nodes in terms of their interests. The GeoSim score is a number bound between zero and one, the lower the score means the two nodes are very close to each other.

The geodesic distance can be calculated using any shortest path algorithm. In this work, Dijkstra algorithm is used. As shown in equation (1), the distance between two nodes is divided by the longest shortest path in the graph. This division is a normalization factor to ensure that the distance remains between zero and one.

As for the similarity between interests, the WordNet ontology (

Where _{1}, _{2}) is the similarity between interest 1 and interest 2 bound between zero and one, Γ(_{1}, _{2}) is a set of all common ancestors between the two interests and log[͠_{γ}] is the information content in the word. Equation (2) gives a score that indicates whether the two interests are close or not. When the score is one, the two interests are very close to each one.

In a social network, a single node can have a set of interests, _{1}, _{2}, …, _{n}

Where _{1}, _{2}) is the similarity between the two sets of interest _{1} and _{2}.

The GeoSim algorithm (see Algorithm

The GeoSim Algorithm.

1: | |

2: | |

3: | |

4: | |

5: | |

6: | |

7: | |

8: | |

9: | |

10: | |

11: | |

12: | |

13: | |

14: | |

15: | |

16: | |

17: | |

18: | calculate |

19: | |

20: | |

21: | |

22: | |

23: | |

24: | |

25: | |

26: | |

27: | |

28: |

The algorithm can be divided into three stages: Centroids Selection, Distances and Similarities Calculation and Clustering. GeoSim is an iterative algorithm, which means the three stages are executed repeatedly until the membership of communities become stable. We define the

In the Centroids Selection stage (lines 5–12), a mean node is selected to represent each community. In the first iteration of the algorithm, since no communities are detected yet,

The second stage, namely the Distances and Similarities Calculation stage (lines 17–18), is the heart of the algorithm. In this stage, all the required distances and similarities are calculated. The distance and the similarity are calculated between the node and all the available mean nodes. The distance is calculated using Dijkstra’s shortest path algorithm. Although in our implementation we used Dijkstra algorithm due to its simplicity, other more efficient algorithms such as A* algorithm and Floyd-Warshall algorithm can be employed in the second stage. The similarities between interests are measured using equation (3).

The last stage, the Clustering stage (lines 19–22), uses the results found by the previous stage to perform the actual clustering. The GeoSim scores between every node and all communities (mean nodes) are calculated using equation (1). Each node would have

The algorithm consists of a main loop that will iterate until a threshold is matched or the number of maximum iterations ^{3}) The total time complexity for the GeoSim algorithm is ^{3}).

The GeoSimMR algorithm is a parallel version of the original GeoSim that is modified to run on the MapReduce framework. Recall that GeoSim algorithm consists of three stages as shown in Figure

The three stages of the proposed GeoSim algorithm, where

In the GeoSimMR, each stage is implemented separately as each stage depends on the result of the previous one and cannot run simultaneously. In other words, each stage has its own implementation of the map and the reduce methods. From Figure

Since each mapper in the framework processes files in a one-line-basis, a certain format should be accommodated to be used across all input files. In our implementation each line in the input files has the following format:

The

In this stage, a mean node from each community is selected. In the map phase, each mapper receives a line that is formatted as discussed earlier. Based on the list of friends from the input line, the mapper counts the number of friends. The mapper then emits the community id of the node as the key and the node itself as the value.

The combiner receives the results from the mapper stage and combines the values of each pair of the same key into a list. The partitioner takes these lists and sends them to the reducers.

In the reduce stage, each reducer receives a community along with a list of nodes in that community. Using the number of friends calculated in the map phase, the reducer selects the mean node for that community. Recall that the mean node for a community is the node with the highest number of friends. Once the reducer finds the mean node, it emits the community id as the key and the node id with its interests as a value.

Figure

An example of the map and reduce phases of the Mean Nodes Selection stage of the GeoSimMR algorithm.

Mean Nodes Selection Stage.

1: | |

2: | |

3: | |

4: | |

5: | |

6: | |

7: | |

8: | maxFriends = 0 |

9: | |

10: | |

11: | |

12: | |

13: | |

14: | |

15: | |

16: |

This stage is the core of the GeoSimMR since all the necessary calculations are done here. The input to this stage is the same as the previous one in addition to the mean nodes obtained in the Mean Node Selection stage. The input file is split into lines and fed to the mappers. In the map phase, each mapper receives a group of lines representing the nodes as well as all the mean nodes and their interests. For each selected node, the mapper performs two operations. The first operation is the calculation of the similarities, using equation (3), between the current node and all the mean nodes. Since the output of the previous stage includes the interests of each mean node, finding the similarities here is not an issue.

The second operation for the mapper involves the distance calculation. Recall that Dijkstra’s algorithm, for a given source node, finds the shortest paths from that source node to all nodes in the graph. The source nodes in the GeoSimMR are the mean nodes. The mapper knows whether an input node is a source node or not by checking its ID against the IDs in the mean nodes list. If a certain node is found to be a mean node, the mapper starts the shortest path calculation between this node and all nodes in the graph.

Note that this process happens for each mean node in the graph. So for a certain node

It is worth mentioning that when a node is emitted in the map phase, the key value has two parts. The first part of the key is the community id and the second part is the node id. The value of the pair is the node itself.

In the reduce phase, the reducer receives a list of distances for each node. These distances represent the cost of the path so far from this node to the mean node. The role of the reducer is to select the smallest distance and assign it to the node. Then, the reducer emits the node with its information along with the updated distance.

As stated before, this stage is an iterative process. This is because all nodes must be visited in a sequential manner in order to find the shortest paths from all mean nodes to all nodes in the graph. So, the stage keeps on traversing and updating the nodes distances until all of them have been visited.

One thing to note here is that operations in the first iteration and the rest differ. In the first iteration, both the similarity and the distance are calculated. Since the mean nodes remain the same, one mean nodes selection at each main iteration; for the rest of the iterations and therefore their interests, there is no need to calculate the similarities again.

Figure

An example of the map and the reduce phases of the Distances and Similarities Calculation stage of the GeoSimMR algorithm.

Once looping through the friends is done, the same node, node 1, is processed again against community 2 and 3, as shown in the figure. Node 1 is not the mean node for community 2. In this case, only the similarity between node 1 and the mean node for this community is calculated and then the node is emitted.

When node 1 is emitted for community 1, the key consists of the community id and the node id. Similarly, when node 1 is emitted for community 2 the key is {2, 1}.

At the reduce stage, reducer 1 receives several lists, one of them is a list of node 4 for community 1. In Figure

Distances and Similarities Calculation Stage.

1: | |

2: | |

3: | |

4: | |

5: | |

6: | |

7: | |

8: | |

9: | |

10: | |

11: | |

12: | |

13: | |

14: | |

15: | |

16: | |

17: | |

18: | |

19: | |

20: | |

21: | |

22: | |

23: | |

24: | |

25: | |

26: | |

27: | |

28: | |

29: | |

30: | |

31: | |

32: | |

33: | |

34: | |

35: | |

36: | |

37: | |

38: | |

39: |

As discussed earlier, in this stage, the calculated distances and similarities are used to perform the actual clustering using equation (1). The input to this stage is the output of the Distance and Similarity Calculation stage. In the map stage, each mapper receives a group of nodes with the distances and similarity already calculated. The mapper uses equation (1) to calculate the

In the reduce phase, each reducer receives the node with a list of

Figure

An example of the map and reduce phases of the Clustering stage of the GeoSimMR algorithm.

Clustering Stage.

1: | |

2: | |

3: | |

4: | |

5: | |

6: | |

7: | |

8: | |

9: | |

10: | |

11: | |

12: | |

13: | |

14: | |

15: | |

16: | |

17: |

In this section, we show the results obtained using the GeoSim and GeoSimMR algorithms. GeoSim is experimented to see how well it detects the communities in a given social network. The GeoSimMR is tested to see the scalability performance when executed on big networks on top of MapReduce framework. The experiments are conducted using a three-node cluster. Each workstation has 22GB of system memory. Three of these machines are powered by an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz whereas the last workstation is powered by an Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz. In total, we have 42 mappers and 21 reducers. Apache Hadoop version 2.7 is used for testing the MapReduce version of GeoSim algorithm.

The GeoSim is tested to show the accuracy of the algorithm in detecting communities. For this test, we used a relatively small network to facilitate the analysis. The network consists of 50 nodes with 74 different interests distributed across all nodes with four to five interests per node. The 74 interests can be grouped into five main categories: Languages, Medicine, Computer, Professions, Sports. The distribution of interests is not random, each ten-node group is assigned with interests under the same main category. For instance, the nodes from 0 to 9 have interests from the Languages group and 10 to 19 from Medicine and so on. We setup the input network this way to get an indication of how accurate the output of the algorithm is. The results of running GeoSim over the network is shown in Figure

The detected communities in the network by the GeoSim algorithm (α = 0.7).

Figure

Another interesting case is with node number 42 in the cyan community. This node is connected to two mean nodes: node 49 (for the cyan community) and node 12 (for the red community). So even though node 42 is directly connected to two mean nodes, it was grouped with 49 rather than 12. The reason behind this is that the mean node 49 has more interests in common with node 42 than the mean node 12.

Another experiment was carried out on a real graph from the DBLP database. A total of 30 nodes were selected: nodes from 0 to 9 published papers related to database, nodes with ids from 10 to 19 published papers related to data mining and nodes from 29 to 30 published papers related to artificial intelligence. Table

DBLP authors used as a ground truth in the community detection experiment.

Database | Data Mining | Artificial Intelligence |
---|---|---|

V. S. Subrahmanian | Alex Beutel | Donald Perlis |

David j. DeWitt | Rakesh Agrawal | Mary Anne Williams |

Michael Stonebraker | Ramakrishnan Srikant | Wei Liu |

Hans Peter Kriegel | Christos Faloutsos | Keith Johnson |

Timos K. Sellis | Flavio Figueiredo | Sanjay Chawla |

Laura M. Haas | Bruno Ribeiro | James Bailey |

H. V. Jagadish | Bussara M. Almeida | Christopher Leckie |

Vassilis J. Tsotras | Yasuko Matsubara | Kotagiri Ramamohanarao |

Raymond T. Ng | lei li | Daren Ler |

Christian S. Jensen | Evangelos E. Papalexakis | Irena Koprinska |

Detected communities in a DBLP graph.

We have conducted several experiments to test the scalability of the GeoSimMR algorithm. In these experiments, the input data consists of 100,000 nodes, where each node has four to five skills.

The first experiment examines the different execution times of the GeoSimMR when varying the number of mappers and reducers. Figure

Execution time for the GeoSimMR using different number of mappers and reducers.

Figure

As shown in Figure

Recall from a previous section, the GeoSimMR algorithm consists of three stages: Mean Nodes Selection, Distance and Similarity Calculation and Clustering stages. To further analyze the GeoSimMR algorithm, the execution time of each stage is presented.

Figure

The execution time of the Mean Nodes Selection stage using different number of mappers and reducers.

Figure

The execution time of the Distances and Similarities Calculation stage using different number of mappers and reducers.

In contrast to the execution time of the first stage, which showed minor variation when changing the number of mappers, the second stage clearly benefited a lot from the parallelism. In fact, the execution time of the second stage is very close to the execution time of all stages combined. That is 96% of the whole time was spent on executing the second stage. This is due to the fact that all the necessary calculations, distances and similarities are performed in this stage.

Note that the distance and similarity calculation stage is an iterative phase. To analyze this stage even further, the execution time for each iteration is examined. Figure

The execution time for every iteration of the second stage of the GeoSimMR.

Figure

The execution time for the Clustering stage using different number of mappers and reducers.

It is clear that the execution time for the clustering stage does not take as much as the second stage. The reason behind this is that the third stage does not involve any complex calculation. Recall from the previous section, the third stage uses the distances and similarities obtained from the second stage to group the nodes into communities.

Two more experiments were conducted to analyze the effect of changing the number of the mappers and reducers. Figure

The overall execution time for different mappers while fixing the number of reducers.

It is clear that when the number of mappers is small, adding another mapper drastically improves the execution time. For instance, in case of one-mapper, the algorithm takes about 11,000 seconds to finish. Adding another mapper to the first one reduces the execution time to 5600 seconds (about 50% improvement). As the number of mappers increases, the effect of the added mappers starts to fade away. The reason behind this behavior is that adding new mappers while the number of reducers is fixed will cause more results to come out of the map stage. This leads to a point where the current number of reducers cannot handle that amount of data from the map stage. In other words, the reducers becomes the bottleneck of the whole job.

Figure

The overall execution time for different reducers while fixing the number of mappers to 42.

Another experiment was conducted to show the gain obtained from using MapReduce. For the sake of comparison, we created centralized version where all stages are re-coded and optimized to run on a single machine without the use of MapReduce. We ran the MapReduce version (with 42 mappers and 21 reducers) and the centralized version to show the time difference between the two versions. Five datasets are used in this experiment. The datasets have the following number of nodes: 100 k, 250 k, 1 m, 5 m and 10 m each clustered into 100 communities. The results of this test are shown in Figure

The overall execution time for the GeoSim and GeoSimMR using different number of nodes (Time axis is semi-log).

In Figure

For a smaller number of nodes, for example, 100,000 nodes, the MapReduce version of the algorithm takes a longer time (around 73% more time) than its centralized version. This is because the overhead of the communication between the mappers and reducers along with the cost of read/write operations are much higher than the computations for this number of nodes. When the number of nodes increases to 1,000,000 nodes, the overhead time becomes very small and negligible compared to the time required for the computation.

In this paper, we presented the GeoSim algorithm that clusters any given social network into communities. Unlike previous community detection techniques, where only the network structure is taken into consideration, GeoSim involves the interests of each node into the clustering process. This paper also introduced a parallel version of GeoSim, namely GeoSimMR, that runs on top of a MapReduce framework. Having a parallel community detection algorithm is very crucial since social networks are typically very huge and a single machine is inadequate to process them. Both GeoSim and GeoSimMR have been examined and the results were presented in the paper. GeoSim achieved high accuracy in terms of detecting communities by grouping close nodes that have similar interests into a single community. In addition to the high accuracy of community detection, the GeoSimMR algorithm takes advantage of available mappers and reducers in order to reduce the execution time drastically.

In the future, we plan to implement our algorithms using other Big Data platforms such as Spark and conduct an analysis and comparison with the MapReduce platform. Additionally, it would be interesting to use these Big Data implementations to find the influential nodes in the detected communities.

The authors have no competing interests to declare.