A New Algorithm for Detecting Local Community Based on Random Walk

This paper presents one new algorithm for local community discovery. It employs a new vertex selection strategy which considers not only the boundary structure of candidate local community but also the probability which the investigated vertex will return to the candidate local community. A local random walk is adopted to compute this return probability which does not require the global information. We choose four algorithms for comparison which are the best ones existed by far. For better evaluation, the datasets include not only the computer generated graphs in standard benchmark but also the real-world networks which are classical ones in global community discovery. The experimental results show our algorithm outperforms the other ones on the computer generated graphs. The performance of our algorithm is approximately the same with the algorithm proposed by Luo, Wang and Promislow on realworld networks.


Introduction
Extracting community structures in complex networks has gained much attention recently.Generally, networks can be modelled as graph ( ) G V E   , where V is a set of vertices representing individuals and E is a set of edges that show the interaction between the vertices.
There is no universally accepted definition of community.Conventionally, a community of a network is a group of vertices that are densely connected amongst themselves while being sparsely connected to the vertices outside the group.Usually, the vertices of one community exhibit certain common characteristics.Many algorithms have been proposed to discover community structure in real world networks.However, these algorithms are supposed to find the entire community structure of the graph.This constraint makes these algorithms cannot handle the dynamic networks and the large scale networks.Unfortunately, networks in real world usually are larger than the scale can be settled by the fastest algorithms [1].
Recent works focus on finding local community structure, which detects the community given a start vertex.This task has many scenarios in daily applications.For example, the police might like to quantify the local communities of a suspect given his social network.Several methods have been proposed to extract local community.However, they suffer in one or more ways.For instance, proposed algorithms in [2], [3] are designed to process the graphs which has a minimal connected initial topology.The methods [4]- [6] in require some degree of global information obeys ascertain partitions.The algorithm in [7] is presented for dynamic networks.In addition, its time complexity is too high which is Due to above limitations, these methods are not widely used.Currently, the popular methods for finding local community are [8]- [10] which will be discussed further in Section 3. Local community can be formally defined as follows: Given an undirected graph ( ) G V E   and a start vertex s V  .In the absence of the global knowledge, a subgraph ( ) is extracting from G containing the start vertex s , where s is densely connect with the vertices of s V than the vertices of s V \V .The algorithms of [8]- [10] have the following problems: Clauset's algorithm [8] give hierarchical community while not output a certain local community.The algorithm proposed by [9] and [10] perform well on the graphs which contains significant local communities but work poor on the graphs without strong community structure.In addition, the correctness of these algorithms drop dramatically when process the vertices that lie on the boundary of local community.The reason is that they are designed mainly on the greedy of local community measure.The merging and removal of vertex in the temporary local community just investigate the boundary.Therefore, these methods could not explain why the output local community is relevant with the start vertex.The limitation will affect the further application of the found local community.In the following section, we will present one measure to the relevance of local community and the start vertex.
This paper proposes one new algorithm for extracting local community.The vertex selection of our algorithm considers only the boundary structure affected by the insertion of vertex but also the probability which the vertex will return to the candidate local community.This return probability is computed by a local random walk which does not demand the global structure of the graph.We compare our algorithm with four algorithms which are the best ones known by far.The datasets includes not only the computer generated ones in standard benchmark but also the ones modelled by real-world networks which are classical ones in global community discovery.The latter ones represents different type of community structure which helps to evaluate the algorithms.The experimental results show our algorithm outperforms the other ones on the graphs in the standard benchmark, and performs almost the same with the algorithm proposed by Luo, Wang and Promislow [9] on real-world datasets.
The rest of this paper is organized as follows: Section 2 presents the related works.The proposed algorithm is given in Section 3. The evaluation of the algorithm on artificial and realworld datasets is illustrated in Section 4. Section 5 discusses some furtherer improvement.Finally, Section 6 concludes the paper.

Related Works
This section describes three state-of-the-art algorithms for detecting local community.In addition, we provide formal definitions of related measures.
Firstly, Clauset gave the definition of local modualarity [8] as the portion of the connecting edges in the boundary edges, where connecting edge denotes the edge connects the vertex in the local community to the vertex outside the community.
Let C denotes the local community and B be the vertices comprise the boundary in which each vertex has at least one neighbor not in C .The boundary-adjacency matrix is defined by Clauset [8] as Based on this, Clauset proposed the measure "local modualarity" to be where Luo, Wang and Promislow [9] proposed several algorithms based on the framework maintaining two vertex queues: adding queue and deleting queue.The merging of the vertex in adding queue and the removal of the vertex in deleting queue will both increase the local community measure.The algorithm repeats computing the these two queues and performing addition and removal operations until these two queues are empty; that is, adding any vertex into C nor removing any vertex in C will improve the measure.At that time, the community C will be output.The employed measure M is defined as follows: where is similar.Luo, Wang and Promislow provided three versions of proposed algorithm [11]: greedy addition, add-all addition and K-like move.Stated by [11], the algorithm using add-all addition performs best among these three versions.We implement the versions of the add-all addition and greedy addition, denoted by Recently, Bagrow [10] employed the greedy measure "outwardness" to select vertices merging into local community.The outwardness of vertex v with respect to community C is defined as: where are the neighbors of v .Bagrow [10] defined a p -strong community as stopping criteria.A community C is called p -strong if a fraction p of vertices in C satisfy that they have more neighbors inside C than outside.Bagrow [10] stated multiple values of p can be used simultaneously.
From the above introduction, it concludes that these three algorithms adopt greedy strategy on respective measures.Unfortunately, these measures are defined mainly on the boundary edges.They do not consider the start vertex directly.For example, LWP's algorithm has to determine whether the start vertex lies in the resulting local community since it contains removed operation.In addition, this shortcoming will be more prominent when the start vertex lies on the boundary of local community, which will be discussed further in the section presenting the experimental results.

Proposed method: Local Community Discovery Algorithm using Local Random Walk
Different from current methods, our algorithm dedicates to evaluate the relation of local community and the start vertex instead of investigating the boundary of local community only.
To achieve this, we employ local random walk strategy to compute the visit probability of the vertices which will be used for the vertex selection.Next, we introduce the random walk strategy and our local random method.
There are various random walk strategies on graphs such as Markov chains [12], quantum random walk [13] and random walk based on vertex degree distribution [7], etc.This paper develops one new local random walk based on Marov chains.Necessary definitions are give, at first.Let ( ) G V E   be a connected graph with n vertices and m edges.Consider a random walk on G : start with vertex 0 v ; at the t -th step, we assume the probability that move to a neighbor of t v is 1 ( ) , where ( ) is the degree of t v (equivalently, the number of neighbors of t v ).Then, the sequence of these random nodes ( 01 ) We denote the matrix of trasition probabilities of this Markov chain by We have We denote the probability that vertex u is visited at step t -th by 0 ( )  stores the probabilities of all vertices in graph G .Then, This paper adopts the version of random walk with restart, because we wish to reveal the relation of the start vertex and other vertices.The formula is where 0 e is one vector of which the value of 0 v 's position is 1 and other values are 0.Then, 0 c e  in Formula (7) indicates that random walk has probability c to restart with 0 v at each step.It is necessary to mention that the computation of Formulas ( 6) and ( 7) requires the global information of graph.Therefore, we could not employ directly.We propose one approximate computation of random walk with restart when searching the candidate subgraph.In addition, we use adjacency list to store the transition probability ij p instead of matrix M .
Then, the probability 0 1 v t   can be computed by the following formula: where ( ) u  denotes the neighbor set of vertex u .We can search the adjacency list of vertex u to obtain ( ) u  .Then, this computation will not demand global adjacent information; that is, can be calculated locally.
Furthermore, we restrict the probability computation within the searching subgraph and the outside boundary.Formally, denote current subgraph by sub G and outside boundary by OB .Then, OB can be formulated by Therefore, Formula ( 8) is changed into: We next introduce the vertex selection strategy of our algorithm.We select one vertex to add into the current community at each step.Our goal is to select the vertices which lie in the same local community as the start vertex.Therefore, the visiting probability which travels from the start vertex is suitable for consideration while choosing vertices.Suppose u is the vertex which is evaluating for choosing.Beside the value of visiting probability of u at step t , we also measure the fraction of probability which u will go back into  be the visiting probability of vertex u while the start vertex is 0 v .Then, our measure for choosing vertex u is defined as: The item  forces that the vertices has high "return" probability obtain high priority to be selected.The reason that exponent number is set to 2 arises from the following two aspects: (i) if the number is set to 1, this measure also works but not as good as that equals 2. Because when exponent number is 1, the measure equals which shows the sum of visiting probability of the neighbors of v in sub G .It does not indicate the fraction of probability which the vertex will return to the candidate local community.Since one vertex has higher "return" fraction, the vertex bears closer connection with the start vertex.Therefore, we do not set the exponent number to 1; (ii) We have tested the number is larger than 2. The results of selection is almost the same.
In brief, we choose the vertex has maximum value of  to be merged into the candidate community sub G .Then, recompute the visiting probability and the values of  until satisfied the stopping criteria.
The stopping criteria of existed algorithms is too simple.For example, most algorithms such as greedy LWP and all LWP algorithms [11] halt when there is no insertion or deletion of vertex can improved the measure.On the other hand, several algorithms such as Bagrow's algorithm [10] stop when their measures reach the desire thresholds.Furthermore, some algorithms does not stop until searching the whole graph.One instance is the algorithm proposed by Clauset [8].We next present the stopping criteria for our select-and-merge procedure.
Our algorithm employs the measure M defined by Formula 3 to mark "potential" local communities.Since our algorithm merges the vertices one by one, it is straightforward that measure for evaluation will increase or decrease dramatically if potential local community is found and this local community is significant.We denote these increasing or decreasing points by "jumping" points.Then, we record the potential local communities which are indicated by these increasing or decreasing points.Finally, we determine output which local community as result.
We now discuss the jumping points further by different cases.Let t m be the value of measure M at step t .
Case 1: Increasingly jumping point.We call t m is one increasingly jumping point if the following proposition holds where 1  and 1 n are parameters such that 1 0 1    and 1 n is a positive integer.One example is illustrated on the left of Figure 1.This type of jumping point shows that one vertex which is not in the same local community as start vertex is merged into the current community.Therefore, we record the community at step 1 t  as one potential community in this case.We call t m is one decreasingly jumping point if the following proposition holds where 2  , 2  and 2 n are parameters such that n is a positive integer.By Formula 12, it is supposed that variable t must be the smallest value of which the corresponding M value indicates the flat or increasing segment.This constrain guarantees that it obtains the correct flat or increasing segment after one dramatically decreasing step.
An example of decreasingly jumping point followed by a flat segment is displayed on the middle of Figure 1.On the other hand, the case followed by an increasing segment is shown on the right of Figure 1.Since the flat segment or increasing segment begins at step t , we record the community at step t as one potential community in this case.
We next present the procedure of our algorithm as follows:

Local Community Discovery Using Local Random Walk
The reason that we use Formula 8 to compute 0 v t  when 1 3 t   is we wish to know visiting probability of the vertices which are very closed with the start vertex 0 v (that is, the vertices of which the distance from 0 v is no larger than 3).Though this computation is not as correct as the classical random walk, it is sufficient for the vertex selection.Moreover, our algorithm just outputs the local community found by the first jumping point.It does not output the hierarchical community according to our problem statement.
We next present the time complexity of our algorithm.It needs v is much smaller than the whole graph G Thus, our algorithm runs in  time in average.

Experiments for Evaluation
In this section, we employ the benchmark method proposed by Bagrow [10] to give an objective comparison with the existing algorithms for finding local community structures.Bagrow's method [10] uses computer generated networks.Besides these artificial networks, we also evaluate the algorithms on famous real-world datasets which show different structures of local community.
Computer generated networks: In Bagrow's benchmark, it creates a classical graph , which is then randomly partitioned into four reference communities to contain equal number of vertices (32 vertices).Each vertex has an average degree 16  , where out-degree out z equal the number of edges connect the vertices outside the communities.It is Clear that a small out z shows a strong community structure.In order to evaluate the performance less affected by randomness, we generate 100 graphs for each out z from 1 to 7.

Real-world datasets:
We choose some famous datasets modeled by real-world graphs such as karate club network [1], US college football league [14] and dolphin social network [15].These graphs has different structures of local community.
Karate club network contains 34 members as vertices and 78 edges representing friendship between members.Due to a disagreement between the club's administrator and the club's instructor, the club splits into two smaller communities.In addition, these two communities are prominent.The US college football network is constructed from the game schedule of the 2000 season.The nodes in the network represent the 115 teams, while the edges represent 613 games played in year.The teams are divided into 11 conferences of 8-12 teams each and generally games are more frequent between teams of the same conference than between teams of different conferences.Therefore, the community structure is strong in each conference.
The dolphins social network studied by David and Lusseau [15] was constructed from observations of a community of 62 bottle-nose dolphins.This network is divided into two groups according to their age.But the community structure is not as significant as the karate club network.We next provide the evaluation measures.
Normalized mutual information (NMI) is an important evaluation criteria in local community discovery [16] [10].The correct partition is denoted by , where R C is the reference local community.Similarly, the found is denoted by , where F C is the result local community.A confusion matrix N is employed in NMI measure, where the rows correspond to the real communities, and the columns correspond to the found communities.The element of N , ij N is the number of nodes in the real community i that appear in the found community j .Then, then NMI measure of similarity between the partitions, based on information theory, is defined below: where the sum over row i of matrix ij N is denoted .
i N and the sum over column j is denoted Thus, a NMI score of 1 shows that both communities are identical and a score of 0 is when these two communities are totally independent.
On the other hand, we employ the F -measure to reveal the efficiency of local community discovery algorithms.Precision is the fraction of the F C retrieved that lies in R C .


We adopt the weighted harmonic mean of precision and recall.That is, the traditional F -measure is

(
) The parameter setting is given as follows: Formula 11 shows the conditions that one step is considered as one jumping point: 1  represents increasing gap and 1 n requires how many consecutive points should reach the gap.Similarly, 2  and 2 n represent the decreasing gap and the number of consecutive points obtaining the gap, respectively.In addition, 2  stands for the threshold of one flat segment and 2 m show the number of consecutive points of which the values should not go beyond the threshold.
In our algorithm, we set 1 0 05 Since the measure M will diminish to 0 when the subgraph sub G covers the whole graph G , it is supposed to halt the search while sub G is too large.In this case, there is probably no significant local community containing the start vertex 0 v .We use parameter  to stop the algorithm when measure M is too small.The parameter  is set to 0.15 in our algorithm.
We compare our algorithm, denoted by LRW , with the state-of-the-art ones by far which are Bagrow's algorithm [10], n LWP and A LWP proposed by Luo, Wang and Promislow [11].
It is note that we choose the best p in range {0 75 0 76 1}       for all the datasets according to Bagrow's algorithm.We found overall performance is best when 0 9 p   which coheres to the result in [10].the results of LRW stand the second position.It appears that the overall results of n LWP algorithm and our algorithm are better than the others.Therefore, we show the overall performance which is given by Figure 3.In addition, these two algorithms outperform the others.On the other hand, our algorithm outperforms the other algorithms on the datasets of the standard benchmark of local community discovery.It improves the F-measure and normalized mutual information by about 0.05 on average.Moreover, the results show that our algorithm is more stable than the others while given different vertices.
Furthermore, our local community evaluation measure can adopt different metrics.The metrics for evaluating (local) community have been intensive studied, such as [17].Once the quantity of the metric is normalized into [0,1], our stopping criteria is applicable.Therefore, our method can use this metric for local community discovery.
Moreover, our method can be integrated to semantic network while detecting topical community [18].The routine is to amend the visiting probability for one vertex to its neighbor which is based on topology into the measure of topical similarity.That is, the neighbor which is more similar has high visiting probability.
than many global community discovery algorithms.

Figure 1 .
Figure 1.Different cases of jumping points degree of the vertices in graph G .For the remained step 3 t  , the running time is bounded by


The recall is the fraction of R C which are successfully detected by F C .TELKOMNIKA ISSN: 1693-6930  A New Algorithm for Detecting Local Community Based on Random Walk (Yueping Li)

Figure 3 .
Figure 3. Overall performance of the results on real-world datasets

Figure 4
Figure 4. avg F and avg I results on 128-nodes network Overall

AFigure 5 .
Figure 5. Overall performance of the results on 128-nodes network

Table 1 .
Results on real-world datasets