CONTINUOUS TOP-K MONITORING ONDOCUMENT STREAMS

 

ABSTRACT

The efficient processing of document streams plays an important role in many information filtering systems. Emergingapplications, such as news update filtering and social network notifications, demand presenting end-users with the most relevantcontent to their preferences. In this work, user preferences are indicated by a set of keywords. A central server monitors the documentstream and continuously reports to each user the top-k documents that are most relevant to her keywords. Our objective is to supportlarge numbers of users and high stream rates, while refreshing the top-k results almost instantaneously. Our solution abandons thetraditional frequency-ordered indexing approach. Instead, it follows an identifier-ordering paradigm that suits better the nature of theproblem. When complemented with a novel, locally adaptive technique, our method offers (i) proven optimality w.r.t. the number ofconsidered queries per stream event, and (ii) an order of magnitude shorter response time (i.e., time to refresh the query results) thanthe current state-of-the-art.

EXISTING SYSTEM:

in information filtering the objective is to remove froman information stream those items that are of no interestto the end users. Information filtering approacheshave been studied for text streams, however, theirfocus is to determine an appropriate relevance threshold,based on the user’s profile and the stream’s characteristics. The actual filtering involves fixed thresholds(and therefore binary relevance assessments per streamitem), rather than relative similarity and ranking.Publishsubscribe is a messaging pattern where thepublishers of messages categorize their messages intoclasses, and the subscribers receive only those messagesthat fall in their classes of interest. UnlikeCTQD, there is typically a set of predefined classes(instead of terms) and there is no notion of relative ranking.does consider relative similarity, however, itsgoal is to identify the k most relevant queries for everynewly published message. proposes a probabilisticalgorithm that keeps a select subset of the messages in asliding window to support approximate top-k processing.Still in the publishsubscribe setting,  considersthe social annotation of news articles. Specifically, givena set of news stories (documents), it maintains for eachof them the k most rel

PROPOSED SYSTEM:

The contributions we makein this paper are summarised as follows:_ Our advanced approach (MRIO) outperforms thecurrent state-of-the-art by one order of magnitude._ MRIO employs novel bounds that offer proven optimalityw.r.t. the number of considered queries perstream event._ MRIO is more than two times faster than RIO,demonstrating that a skillful adaptation of IDorderingto CTQDs alone (as in RIO) is not enoughto derive the improvements achieved in this work._ We further improve the performance of MRIOby restructuring its query index (i.e., rearrangingthe queries inside) to better exploit locality andstrengthen the pruning effectiveness of its bounds_ Our evaluation has a broader experimental valuetoo, because it involves (besides the state-of-theartfor CTQDs) methods for different formulations,which perform competitively, and were never putin the same testbed before.

 

 

CONCLUSION

In information filtering the objective is to remove froman information stream those items that are of no interestto the end users. Information filtering approacheshave been studied for text streams, however, theirfocus is to determine an appropriate relevance threshold,based on the user’s profile and the stream’s characteristics. The actual filtering involves fixed thresholds(and therefore binary relevance assessments per streamitem), rather than relative similarity and ranking.Publishsubscribe is a messaging pattern where thepublishers of messages categorize their messages intoclasses, and the subscribers receive only those messagesthat fall in their classes of interest. UnlikeCTQD, there is typically a set of predefined classes(instead of terms) and there is no notion of relative ranking. does consider relative similarity, however, itsgoal is to identify the k most relevant queries for everynewly published message.proposes a probabilisticalgorithm that keeps a select subset of the messages in asliding window to support approximate top-k processing.Still in the publishsubscribe setting, considersthe social annotation of news articles. Specifically, givena set of news stories (documents), it maintains for eachof them the k most related tweets posted

 

REFERENCES

[1] P. Haghani, S. Michel, and K. Aberer, “The gist of everything new:personalized top-k processing over web 2.0 streams.” in CIKM,2010, pp. 489–498.

[2] K. Mouratidis and H. Pang, “Efficient evaluation of continuoustext search queries,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 10,pp. 1469–1482, 2011.

[3] N. Vouzoukidou, B. Amann, and V. Christophides, “Processingcontinuous text queries featuring non-homogeneous scoring functions.”in CIKM, 2012, pp. 1065–1074.

[4] A. Hoppe, “Automatic ontology-based user profile learning fromheterogeneous web resources in a big data context.” PVLDB, pp.1428–1433, 2013.

[5] A. Lacerda and N. Ziviani, “Building user profiles to improveuser experience in recommender systems,” in WSDM, 2013, pp.759–764.

[6] M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. J. Lin,“Earlybird: Real-time search at twitter,” in ICDE, 2012, pp. 1360–1369.

[7] L. Wu, W. Lin, X. Xiao, and Y. Xu, “LSII: an indexing structure forexact real-time search on microblogs,” in ICDE, 2013, pp. 482–493.

[8] J. Zobel and A. Moffat, “Inverted files for text search engines,”ACM Comput. Surv., vol. 38, no. 2, 2006.

[9] R. Fagin, A. Lotem, and M. Naor, “Optimal aggregation algorithmsfor middleware,” J. Comput. Syst. Sci., vol. 66, no. 4, pp.614–656, 2003.

[10] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Y. Zien,“Efficient query evaluation using a two-level retrieval process.”in CIKM, 2003, pp. 426–434.