UniWalk: Unidirectional Random Walk Based Scalable SimRank Computation over Large Graph
Abstract
SimRank is an effective structural similarity measurement between two vertices in a graph, which can be used in many applications like recommender systems. Although progresses have been achieved, existing methods still face challenges to handle large graphs. Besides huge index construction and maintenance cost, the existing methods require considerable search space and time overheads in the online SimRank query. In this paper, we design a Monte Carlo based method, Uni- Walk, to enable the fast top-k SimRank computation over large undirected graphs without indexing. UniWalk directly locates the top-k similar vertices for any single source vertex u via O(R) sampling paths originating from u only, which avoids the selection of candidate vertex set C and the following O(|C|R) bidirectional sampling paths starting from u and each candidate respectively in existing methods. We also design a space-efficient method to reduce intermediate results, and a path-sharing strategy to optimize path sampling for multiple source vertices. Furthermore, we extend UniWalk to existing distributed graph processing frameworks to improve its scalability. We conduct extensive experiments to illustrate that UniWalk has high scalability, and outperforms the state-of-the-art methods by orders of magnitude, and such an improvement is achieved without any indexing overheads.
Existing System
SimRank can be used in finding similar users for a given user u, and then the items purchased by these similar users are combined as the recommended candidates for u. Intuitively, SimRank follows the idea that two objects are similar if they are related to similar objects. Although SimRank is well-studied, SimRank is still prohibitively expensive and difficult due to its recursively dependency computation. Roughly speaking, existing approaches to SimRank can be categorized into straightforward and Monte Carlo methods approaches. Originally, SimRank is defined recursively, which can be computed iteratively until a fixed point is reached. That is, to compute all-pair SimRank scores, the straightforward method. Monte Carlo methods for SimRank, which yield results with lower overheads. SimRank score between the vertex u and v is also interpreted as the expected meeting distance between u and v [2], which can be simulated by sufficient bidirectional random walks (BiWalk for short) starting from u and v respectively. The time cost of such a method is independent of the graph size (i.e., the number of vertices). In addition, the error bound of results computed by Monte Carlo methods exponentially declines with respect to the number of sampling paths. These two factors make Monte Carlo methods have potential to compute SimRank scores in a large graph.
Proposed System
In this paper, we overcome the limitations of existing methods by introducing a new unidirectional random walk strategy named UniWalk for SimRank. Compared with the existing BiWalk, UniWalk has the following advantages: It does not require indexing or candidate selection phase; It needs much fewer sampling paths in the online query phase to compute the top-k single-source SimRank scores; It enables high parallelism and easy computation sharing, and supports the single-source, multiple-sources, and all-pair SimRank computation adaptively. Currently, UniWalk works on undirected graphs, and we leave directed graphs as our future work. Specifically, our main contributions are as below: 1) We propose a novel unidirectional random walk, UniWalk, to compute the top-k SimRank for given source vertices S . With a rectified factor, UniWalk can simulate the original BiWalk in computing Sim- Rank scores. The time cost of UniWalk (excluding sorting cost for the final top-k results) is O(|S |RL2), where R is the number of 2L-length sampling paths. Usually, R and L do not vary much. Thus, the cost of UniWalk is approximately linear to the number of source vertices. 2) We further introduce two optimization strategies, including an M-times candidates strategy to relax the space requirement, and a path-sharing strategy for multiple sources to lower path sampling cost. We also devise a distributed UniWalk method, and illustrate that UniWalk can easily leverage existing distributed graph processing frameworks to improve the scalability. 3) We conduct extensive experiments on real-life datasets. Experiments show that UniWalk can achieve high precision and scalability, and outperform the existing methods by orders of magnitude.
Conclusion
In this paper, we devise a new unidirectional random walk method named UniWalk to directly locate vertices with the topk SimRank scores for given sources on a large graph. UniWalk does not require indexing or candidate selection phases, needs fewer sampling paths while achieving the same expected values as the existing BiWalk method, and enables higher parallelism and scalability. In addition, optimization strategies are devised to further reduce the space requirement and path sampling cost for multiple sources, and the distributed UniWalk is developed to improve its scalability. The experimental results on large graphs illustrate that UniWalk outperforms the state-of-the-art methods by orders of magnitude.
References
[1] P. Nguyen, P. Tomeo, T. Noia, and E. Sciascio, “An evaluation of simrank and personalized pagerank to build a recommender system for the web of data,” in WWW, 2015, pp. 1477–1482.
[2] G. Jeh and J. Widom, “Simrank: a measure of structural-context similarity,” in KDD, 2002, pp. 538–543.
[3] W. Yu, X. Lin, and W. Zhang, “Towards efficient simrank computation on large networks,” in ICDE, 2013, pp. 601–612.
[4] G. He, H. Feng, C. Li, and H. Chen, “Parallel simrank computation on large graphs with iterative aggregation,” in WWW, 2010, pp. 543–552.
[5] Z. Li, Y. Fang, Q. Liu, J. Cheng, R. Cheng, and J. Lui, “Walking in the cloud: Parallel simrank at scale,” PVLDB, vol. 9, no. 1, pp. 24–35, 2015.
[6] Y. Shao, B. Cui, L. Chen, M. Liu, and X. Xie, “An efficient similarity search framework for simrank over large dynamic graphs,” PVLDB, vol. 8, no. 8, pp. 838–849, 2015.
[7] M. Kusumoto, T. Maehara, and K. Kawarabayashi, “Scalable similarity search for simrank,” in SIGMOD, 2014, pp. 325–336.
[8] D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov, “Accuracy estimate and optimization techniques for simrank computation,” VLDB J., vol. 19, no. 1, pp. 45–66, 2010.
[9] D. Fogaras and B. R´acz, “Scaling link-based similarity search,” in WWW, 2005, pp. 641–650.
[10] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing.” in SIGMOD, 2010, pp. 135–146.