UNDERSTAND SHORT TEXTS BY HARVESTING AND ANALYZING SEMANTIC KNOWLEDGE

 

ABSTRACT

Understanding short texts is crucial to many applications, but challenges abound. First, short texts do not always observe thesyntax of a written language. As a result, traditional natural language processing tools, ranging from part-of-speech tagging to dependencyparsing, cannot be easily applied. Second, short texts usually do not contain sufficient statistical signals to support many state-of-the-artapproaches for text mining such as topic modeling. Third, short texts are more ambiguous and noisy, and are generated in an enormousvolume, which further increases the difficulty to handle them. We argue that semantic knowledge is required in order to better understandshort texts. In this work, we build a prototype system for short text understanding which exploits semantic knowledge provided by a well-knownknowledgebase and automatically harvested from a web corpus. Our knowledge-intensive approaches disrupt traditional methods for taskssuch as text segmentation, part-of-speech tagging, and concept labeling, in the sense that we focus on semantics in all these tasks. Weconduct a comprehensive performance evaluation on real-life data. The results show that semantic knowledge is indispensable for short textunderstanding, and our knowledge-intensive approaches are both effective and efficient in discovering semantics of short texts.C

EXISTING SYSTEM:

we discuss related work in three aspects: textsegmentation, POS tagging, and semantic labeling.Text Segmentation. We consider text segmentation as to dividea text into a sequence of terms. Existing approaches can be classifiedinto two categories: statistical approaches and vocabularybasedapproaches. Statistical approaches, such as N-gram Model, calculate the frequencies of words co-occurringas neighbors in a training corpus. When the frequency exceeds a         predefined threshold, the corresponding neighboring words can betreated as a term. Vocabulary-based approaches extract terms ina streaming manner by checking for existence or frequency of aterm in a predefined vocabulary. In particular, the Longest Covermethod, which is widely-adopted for text segmentation due to its simplicity and real-time nature, searchesfor longest terms contained in a vocabulary while scanning thetext.

PROPOSED SYSTEM

Overall, our contributions in thiswork are threefold:1041-4347 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.2_ We observe the prevalence of ambiguity in short texts andthe limitations of traditional approaches in handling them;_ We achieve better accuracy of short text understanding byharvesting semantic knowledge from web corpus and existingknowledgebases, and introducing knowledge-intensiveapproaches based on lexical-semantic analysis;_ We improve the e_ciency of our approaches to facilitateonline instant short text understanding.

CONCLUSION

In this work, we propose a generalized framework to understandshort texts e_ectively and e_ciently. More specifically, we dividethe task of short text understanding into three subtasks: textsegmentation, type detection, and concept labeling. We formulatetext segmentation as a weighted Maximal Clique problem,and propose a randomized approximation algorithm to maintainaccuracy and improve e_ciency at the same time. We introducea Chain Model and a Pairwise Model which combine lexicaland semantic features to conduct type detection. They achievebetter accuracy than traditional POS taggers on the labeled benchmark.We employ a Weighted Vote algorithm to determine themost appropriate semantics for an instance when ambiguity isdetected. The experimental results demonstrate that our proposedframework outperforms existing state-of-the-art approaches in thefield of short text understanding. As a future work, we attemptto analyze and incorporate the impact of spatial-temporal featuresinto our framework for short text understanding.

REFERENCES

[1] A. McCallum and W. Li, “Early results for named entity recognition withconditional random fields, feature induction and web-enhanced lexicons,”in Proceedings of the Seventh Conference on Natural Language Learningat HLT-NAACL 2003 – Volume 4, ser. CONLL ’03, Stroudsburg, PA, USA,2003, pp. 188–191.

[2] G. Zhou and J. Su, “Named entity recognition using an hmm-based chunktagger,” in Proceedings of the 40th Annual Meeting on Association forComputational Linguistics, ser. ACL ’02, Stroudsburg, PA, USA, 2002,pp. 473–480.

[3] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J.Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.

[4] M. Rosen-Zvi, T. Gri_ths, M. Steyvers, and P. Smyth, “The author-topicmodel for authors and documents,” in Proceedings of the 20th Conferenceon Uncertainty in Artificial Intelligence, ser. UAI ’04, Arlington, Virginia,United States, 2004, pp. 487–494.

[5] R. Mihalcea and A. Csomai, “Wikify! linking documents to encyclopenowledge,” in Proceedings of the sixteenth ACM conference on Conferenceon information and knowledge management, ser. CIKM ’07, NewYork, NY, USA, 2007, pp. 233–242.

[6] D. Milne and I. H. Witten, “Learning to link with wikipedia,” in Proceedingsof the 17th ACM conference on Information and knowledgemanagement, ser. CIKM ’08, New York, NY, USA, 2008, pp. 509–518.

[7] S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti, “Collectiveannotation of wikipedia entities in web text,” in Proceedings of the 15thACM SIGKDD international conference on Knowledge discovery and datamining, ser. KDD ’09, New York, NY, USA, 2009, pp. 457–466.

[8] X. Han and J. Zhao, “Named entity disambiguation by leveragingwikipedia semantic knowledge,” in Proceedings of the 18th ACM conferenceon Information and knowledge management, ser. CIKM ’09, NewYork, NY, USA, 2009, pp. 215–224.

[9] ——, “Structural semantic relatedness: A knowledge-based method tonamed entity disambiguation,” in Proceedings of the 48th Annual Meetingof the Association for Computational Linguistics, ser. ACL ’10, Stroudsburg,PA, USA, 2010, pp. 50–59.

[10] X. Han, L. Sun, and J. Zhao, “Collective entity linking in web text:A graph-based method,” in Proceedings of the 34th International ACMSIGIR Conference on Research and Development in Information Retrieval,ser. SIGIR ’11, New York, NY, USA, 2011, pp. 765–774.

[11] W. Shen, J. Wang, P. Luo, and M. Wang, “Linden: Linking named entitieswith knowledge base via semantic knowledge,” in Proceedings of the 21stInternational Conference on World Wide Web, ser. WWW ’12, New York,NY, USA, 2012, pp. 449–458