CONTINUOUS TOP-K MONITORING ONDOCUMENT STREAMS

CONTINUOUS TOP-K MONITORING ONDOCUMENT STREAMS

ABSTRACT

The efﬁcient processing of document streams plays an important role in many information ﬁltering systems. Emergingapplications, such as news update ﬁltering and social network notiﬁcations, demand presenting end-users with the most relevantcontent to their preferences. In this work, user preferences are indicated by a set of keywords. A central server monitors the documentstream and continuously reports to each user the top-k documents that are most relevant to her keywords. Our objective is to supportlarge numbers of users and high stream rates, while refreshing the top-k results almost instantaneously. Our solution abandons thetraditional frequency-ordered indexing approach. Instead, it follows an identiﬁer-ordering paradigm that suits better the nature of theproblem. When complemented with a novel, locally adaptive technique, our method offers (i) proven optimality w.r.t. the number ofconsidered queries per stream event, and (ii) an order of magnitude shorter response time (i.e., time to refresh the query results) thanthe current state-of-the-art.

EXISTING SYSTEM:

in information ﬁltering the objective is to remove froman information stream those items that are of no interestto the end users. Information ﬁltering approacheshave been studied for text streams, however, theirfocus is to determine an appropriate relevance threshold,based on the user’s proﬁle and the stream’s characteristics. The actual ﬁltering involves ﬁxed thresholds(and therefore binary relevance assessments per streamitem), rather than relative similarity and ranking.Publishsubscribe is a messaging pattern where thepublishers of messages categorize their messages intoclasses, and the subscribers receive only those messagesthat fall in their classes of interest. UnlikeCTQD, there is typically a set of predeﬁned classes(instead of terms) and there is no notion of relative ranking.does consider relative similarity, however, itsgoal is to identify the k most relevant queries for everynewly published message. proposes a probabilisticalgorithm that keeps a select subset of the messages in asliding window to support approximate top-k processing.Still in the publishsubscribe setting, considersthe social annotation of news articles. Speciﬁcally, givena set of news stories (documents), it maintains for eachof them the k most rel

PROPOSED SYSTEM:

The contributions we makein this paper are summarised as follows:_ Our advanced approach (MRIO) outperforms thecurrent state-of-the-art by one order of magnitude._ MRIO employs novel bounds that offer proven optimalityw.r.t. the number of considered queries perstream event._ MRIO is more than two times faster than RIO,demonstrating that a skillful adaptation of IDorderingto CTQDs alone (as in RIO) is not enoughto derive the improvements achieved in this work._ We further improve the performance of MRIOby restructuring its query index (i.e., rearrangingthe queries inside) to better exploit locality andstrengthen the pruning effectiveness of its bounds_ Our evaluation has a broader experimental valuetoo, because it involves (besides the state-of-theartfor CTQDs) methods for different formulations,which perform competitively, and were never putin the same testbed before.

CONCLUSION

In information ﬁltering the objective is to remove froman information stream those items that are of no interestto the end users. Information ﬁltering approacheshave been studied for text streams, however, theirfocus is to determine an appropriate relevance threshold,based on the user’s proﬁle and the stream’s characteristics. The actual ﬁltering involves ﬁxed thresholds(and therefore binary relevance assessments per streamitem), rather than relative similarity and ranking.Publishsubscribe is a messaging pattern where thepublishers of messages categorize their messages intoclasses, and the subscribers receive only those messagesthat fall in their classes of interest. UnlikeCTQD, there is typically a set of predeﬁned classes(instead of terms) and there is no notion of relative ranking. does consider relative similarity, however, itsgoal is to identify the k most relevant queries for everynewly published message.proposes a probabilisticalgorithm that keeps a select subset of the messages in asliding window to support approximate top-k processing.Still in the publishsubscribe setting, considersthe social annotation of news articles. Speciﬁcally, givena set of news stories (documents), it maintains for eachof them the k most related tweets posted

REFERENCES

[1] P. Haghani, S. Michel, and K. Aberer, “The gist of everything new:personalized top-k processing over web 2.0 streams.” in CIKM,2010, pp. 489–498.

[2] K. Mouratidis and H. Pang, “Efﬁcient evaluation of continuoustext search queries,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 10,pp. 1469–1482, 2011.

[3] N. Vouzoukidou, B. Amann, and V. Christophides, “Processingcontinuous text queries featuring non-homogeneous scoring functions.”in CIKM, 2012, pp. 1065–1074.

[4] A. Hoppe, “Automatic ontology-based user proﬁle learning fromheterogeneous web resources in a big data context.” PVLDB, pp.1428–1433, 2013.

[5] A. Lacerda and N. Ziviani, “Building user proﬁles to improveuser experience in recommender systems,” in WSDM, 2013, pp.759–764.

[6] M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. J. Lin,“Earlybird: Real-time search at twitter,” in ICDE, 2012, pp. 1360–1369.

[7] L. Wu, W. Lin, X. Xiao, and Y. Xu, “LSII: an indexing structure forexact real-time search on microblogs,” in ICDE, 2013, pp. 482–493.

[8] J. Zobel and A. Moffat, “Inverted ﬁles for text search engines,”ACM Comput. Surv., vol. 38, no. 2, 2006.

[9] R. Fagin, A. Lotem, and M. Naor, “Optimal aggregation algorithmsfor middleware,” J. Comput. Syst. Sci., vol. 66, no. 4, pp.614–656, 2003.

[10] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Y. Zien,“Efﬁcient query evaluation using a two-level retrieval process.”in CIKM, 2003, pp. 426–434.

CONTINUOUS TOP-K MONITORING ONDOCUMENT STREAMS

Recent Post

Project Categories