UniWalk: Unidirectional Random Walk Based  Scalable SimRank Computation  over Large Graph

 

Abstract

 

SimRank is an effective structural similarity measurement  between two vertices in a graph, which can be used  in many applications like recommender systems. Although progresses  have been achieved, existing methods still face challenges  to handle large graphs. Besides huge index construction and  maintenance cost, the existing methods require considerable  search space and time overheads in the online SimRank query.  In this paper, we design a Monte Carlo based method, Uni-  Walk, to enable the fast top-k SimRank computation over large  undirected graphs without indexing. UniWalk directly locates the  top-k similar vertices for any single source vertex u via O(R)  sampling paths originating from u only, which avoids the selection  of candidate vertex set C and the following O(|C|R) bidirectional  sampling paths starting from u and each candidate respectively  in existing methods. We also design a space-efficient method  to reduce intermediate results, and a path-sharing strategy to  optimize path sampling for multiple source vertices. Furthermore,  we extend UniWalk to existing distributed graph processing  frameworks to improve its scalability. We conduct extensive  experiments to illustrate that UniWalk has high scalability, and  outperforms the state-of-the-art methods by orders of magnitude,  and such an improvement is achieved without any indexing  overheads.

 

Existing System

 

SimRank can be used in finding  similar users for a given user u, and then the items purchased  by these similar users are combined as the recommended  candidates for u.  Intuitively, SimRank follows the idea that two objects are  similar if they are related to similar objects. Although SimRank is well-studied, SimRank is still prohibitively  expensive and difficult due to its recursively dependency  computation. Roughly speaking, existing approaches to SimRank can  be categorized into straightforward and Monte Carlo methods  approaches. Originally, SimRank is defined recursively, which  can be computed iteratively until a fixed point is reached. That  is, to compute all-pair SimRank scores, the straightforward  method. Monte Carlo methods for SimRank, which yield results  with lower overheads. SimRank score between the vertex and v is also interpreted as the expected meeting distance  between u and v [2], which can be simulated by sufficient  bidirectional random walks (BiWalk for short) starting from  u and v respectively. The time cost of such a method is  independent of the graph size (i.e., the number of vertices). In  addition, the error bound of results computed by Monte Carlo  methods exponentially declines with respect to the number of  sampling paths. These two factors make Monte Carlo methods  have potential to compute SimRank scores in a large graph.

 

Proposed System  

 

In this paper, we overcome the limitations of existing  methods by introducing a new unidirectional random walk  strategy named UniWalk for SimRank. Compared with the  existing BiWalk, UniWalk has the following advantages: It  does not require indexing or candidate selection phase; It  needs much fewer sampling paths in the online query phase  to compute the top-k single-source SimRank scores; It enables  high parallelism and easy computation sharing, and supports  the single-source, multiple-sources, and all-pair SimRank computation  adaptively. Currently, UniWalk works on undirected  graphs, and we leave directed graphs as our future work.  Specifically, our main contributions are as below:  1) We propose a novel unidirectional random walk,  UniWalk, to compute the top-k SimRank for given  source vertices S . With a rectified factor, UniWalk  can simulate the original BiWalk in computing Sim-  Rank scores. The time cost of UniWalk (excluding  sorting cost for the final top-k results) is O(|S |RL2),  where R is the number of 2L-length sampling paths.  Usually, R and L do not vary much. Thus, the cost  of UniWalk is approximately linear to the number of  source vertices.  2) We further introduce two optimization strategies,  including an M-times candidates strategy to relax  the space requirement, and a path-sharing strategy  for multiple sources to lower path sampling cost.  We also devise a distributed UniWalk method, and  illustrate that UniWalk can easily leverage existing  distributed graph processing frameworks to improve  the scalability.   3) We conduct extensive experiments on real-life  datasets. Experiments show that UniWalk can achieve  high precision and scalability, and outperform the  existing methods by orders of magnitude.

 

 

 

 

Conclusion  

 

In this paper, we devise a new unidirectional random walk  method named UniWalk to directly locate vertices with the topSimRank scores for given sources on a large graph. UniWalk  does not require indexing or candidate selection phases, needs  fewer sampling paths while achieving the same expected values  as the existing BiWalk method, and enables higher parallelism  and scalability. In addition, optimization strategies are devised  to further reduce the space requirement and path sampling cost  for multiple sources, and the distributed UniWalk is developed  to improve its scalability. The experimental results on large  graphs illustrate that UniWalk outperforms the state-of-the-art  methods by orders of magnitude.

 

References 

 

[1] P. Nguyen, P. Tomeo, T. Noia, and E. Sciascio, “An evaluation of  simrank and personalized pagerank to build a recommender system for  the web of data,” in WWW, 2015, pp. 1477–1482.

[2] G. Jeh and J. Widom, “Simrank: a measure of structural-context  similarity,” in KDD, 2002, pp. 538–543.

[3] W. Yu, X. Lin, and W. Zhang, “Towards efficient simrank computation  on large networks,” in ICDE, 2013, pp. 601–612.

[4] G. He, H. Feng, C. Li, and H. Chen, “Parallel simrank computation on  large graphs with iterative aggregation,” in WWW, 2010, pp. 543–552.

[5] Z. Li, Y. Fang, Q. Liu, J. Cheng, R. Cheng, and J. Lui, “Walking in  the cloud: Parallel simrank at scale,” PVLDB, vol. 9, no. 1, pp. 24–35,  2015.

[6] Y. Shao, B. Cui, L. Chen, M. Liu, and X. Xie, “An efficient similarity  search framework for simrank over large dynamic graphs,” PVLDB,  vol. 8, no. 8, pp. 838–849, 2015.

[7] M. Kusumoto, T. Maehara, and K. Kawarabayashi, “Scalable similarity  search for simrank,” in SIGMOD, 2014, pp. 325–336.

[8] D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov, “Accuracy  estimate and optimization techniques for simrank computation,” VLDB  J., vol. 19, no. 1, pp. 45–66, 2010.

[9] D. Fogaras and B. R´acz, “Scaling link-based similarity search,” in  WWW, 2005, pp. 641–650.

[10] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn,  N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph  processing.” in SIGMOD, 2010, pp. 135–146.