FINDING RELATED FORUM POSTS THROUGH CONTENTSIMILARITY OVER INTENTION-BASED SEGMENTATION

FINDING RELATED FORUM POSTS THROUGH CONTENTSIMILARITY OVER INTENTION-BASED SEGMENTATION

ABSTRACT

We study the problem of ﬁnding related forum posts to a post at hand. In contrast to traditional approaches for ﬁndingrelated documents that perform content comparisons across the content of the posts as a whole, we consider each post as a set ofsegments, each written with a different goal in mind. We advocate that the relatedness between two posts should be based on thesimilarity of their respective segments that are intended for the same goal, i.e., are conveying the same intention. This means that it ispossible for the same terms to weigh differently in the relatedness score depending on the intention of the segment in which they arefound. We have developed a segmentation method that by monitoring a number of text features can identify the parts of a post wheresigniﬁcant jumps occur indicating a point where a segmentation should take place. The generated segments of all the posts areclustered to form intention clusters and then similarities across the posts are calculated through similarities across segments with thesame intention. We experimentally illustrate the effectiveness and efﬁciency of our segmentation method and our overall approach ofﬁnding related forum posts.

EXISTING SYSTEM:

Identifying the segments in a forum post is a challengingtask. Forum posts are typically one or two paragraphlong, with complete sentences. They do not follow theabbreviated style used in microblogs, but at the same time,since they are intended for interactive discussions, they arenot verbose and they lack the structural constructs (e.g.,sections) typically used in full-text documents to identifythematic units. Furthermore, since they are driven by thecommon needs of forum participants, they draw heavilytheir content from a common vocabulary (that dependson the nature/topic of the forum), which means that topicvariation, i.e., the used vocabulary, is not a very distinctivefactor for the identiﬁcation of the segments. To deal withthis limitation we resort to text features (characteristics)whose variation can identify a passage from one segmentto another. We made this choice after realizing that the style,tone, brevity, verb tense and other grammatical characteristicscan may serve as indicators of a change in the messagethat the author is trying to communicate. We refer to thesecharacteristics as features and use the term communicationmeans (CM for short) to refer to groups of such features.The idea of using communication means for capturing theintention of a segment (or intended message) is analogousto the idea of using keywords to represent a topic. Similarto the way that a variation in a weighted vector of wordssignals a change in the topic, a variation in a vectorof text features signals a change in the intended message

PROPOSED SYSTEM:

We formally introduce a novel method for ﬁnding relatedforum posts that treats each post as a set ofsegments and computes content similarity only acrosssegments of the same intention.We provide a complete methodology for segment identiﬁcationand for grouping the derived segments intointention clusters that exploit the text features’ variation.We present extensive experiments with real users thatconﬁrm the existence of such segments in forum postsof different domains, and verify the effectiveness of theindividual steps and decisions of our methodology, includingthe border selection mechanisms, the selection offeatures, and last but not least the functions and weightsfor capturing text feature variation.We describe a fully unsupervised multi-segment rankingtechnique that provides the top-k forum posts related toa reference post by considering segments with similarintentions and using content similarities within eachcluster to derive an overall score between each forumpost and the reference post.We evaluate the effectiveness of the overall approach onthe recommendation of related forum posts using ratingsand feedback by users in 3 different domains.

CONCLUSION:

We proposed a novel approach for matching a referencepost to the k most related posts in a collection. Our methodidentiﬁes and exploits post segments that convey similarauthor intentions. We presented several experiments regardingthe right segmentation criteria, the effectivenessof the segmentation algorithms and the formation of intentionclusters that prove that a rather intuitive concept,that of the author intentions to communicate a certain message,can be effectively captured by an automated process.Moreover, due to the nature of the posts, measuring therelatedness score after having distinguished the differentsegments/messages that the authors intend to communicatehas been proved more effective than the direct comparisonof the whole posts. Speciﬁcally, our approach, according toan evaluation by real users and in comparison with directfulltext comparison, increased mean precision by 10%, 12%and 10.1% considering posts in a product support, a travel,and a programming forum

REFERENCES

[1] M. Chen, X. Jin, and D. Shen, “Short text classiﬁcationimproved by learning multi-granularity topics,” in IJCAI,2011, pp. 1776–1781.

[2] J. Jeon, W. B. Croft, and J. H. Lee, “Finding semanticallysimilar questions based on their answers,” in Proceedingsof the 28th ACM SIGIR Conference, ser. SIGIR ’05. NewYork, NY, USA: ACM, 2005, pp. 617–618.

[3] T. C. Zhou, C.-Y. Lin, I. King, M. R. Lyu, Y.-I. Song, andY. Cao, “Learning to suggest questions in online forums.”in AAAI, 2011.

[4] L. Weng, Z. Li, R. Cai, Y. Zhang, Y. Zhou, L. T. Yang, andL. Zhang, “Query by document via a decomposition-basedtwo-level retrieval approach.” Association for ComputingMachinery, Inc., July 2011.

[5] V. Govindaraju and K. Ramanathan, “Similar documentsearch and recommendation,” Journal of Emerging Technologiesin Web Intelligence, vol. 4, no. 1, pp. 84–93, 2012.

[6] S. Robertson, S. Walker, and M. Hancock-Beaulieu, “Okapiat TREC-7: Automatic ad hoc, ﬁltering, VLC and interactivetrack,” TREC ’98, pp. 199–210, 1998.

[7] D. M. Blei, “Probabilistic topic models,” Commun. ACM,vol. 55, no. 4, pp. 77–84, Apr. 2012.

[8] J. Berant and P. Liang, “Semantic parsing via paraphrasing.”in ACL (1), 2014, pp. 1415–1425.

[9] H. Wen, W. Zhongyuan, W. Haixun, Z. Kai, and Z. Xiaofang,“Short text understanding through lexical-semanticanalysis,” in IEEE ICDE, 2015.

[10] Z.-Y. Ming, T.-S. Chua, and G. Cong, “Exploring domainspeciﬁcterm weight in archived question search,” in Proceedingsof the 19th ACM CIKM, ser. CIKM ’10. New York,NY, USA: ACM, 2010, pp. 1605–1608

FINDING RELATED FORUM POSTS THROUGH CONTENTSIMILARITY OVER INTENTION-BASED SEGMENTATION

Recent Post

Project Categories