Topic Models for Unsupervised Cluster Matching

 

Abstract

We propose topic models for unsupervised cluster matching, which is the task of finding matching between clusters in  different domains without correspondence information. For example, the proposed model finds correspondence between document  clusters in English and German without alignment information, such as dictionaries and parallel sentences/documents. The proposed  model assumes that documents in all languages have a common latent topic structure, and there are potentially infinite number of topic  proportion vectors in a latent topic space that is shared by all languages. Each document is generated using one of the topic proportion  vectors and language-specific word distributions. By inferring a topic proportion vector used for each document, we can allocate  documents in different languages into common clusters, where each cluster is associated with a topic proportion vector. Documents  assigned into the same cluster are considered to be matched. We develop an efficient inference procedure for the proposed model  based on collapsed Gibbs sampling. The effectiveness of the proposed model is demonstrated with real data sets including multilingual  corpora of Wikipedia and product reviews.

Existing System 

There have been proposed a number of methods for unsupervised  object matching, which is also called cross-domain  object matching, such as kernelized sorting , and its  convex extension, least square object matching,  matching canonical correlation analysis , and Bayesian  object matching. These methods find matching by  sorting objects so as to maximize dependence, or minimize  independence. For example, kernelized sorting uses Hilbert-  Schmidt Independence Criterion (HSIC) as a measurement  of independence, and sorts objects by minimizing  HSIC between objects in two domains using the Hungarian  algorithm. There are three limitations in these methods.  First, they find only one-to-one matching. Second, the number  of domains need to be two. Third, the number of objects  in each domain should be the same across all domains. On  the other hand, the proposed model does not have these  limitations; it finds cluster matching from data with more  than two domains, and each domain can contain different  numbers of objects.

Proposed System 

The proposed model is an unsupervised method for  cluster matching, which is the task of finding matching between  clusters in different domains, where correspondence  and cluster information are unavailable. For example, the  proposed model finds correspondence between document  clusters in English and German without alignment information, such as dictionaries and parallel sentences/documents.  Here, parallel sentences/documents mean that its German  translation is attached to each sentence/document in English. Although we assume that the given data are text documents  with multiple languages in this paper, where each language  corresponds to a domain, the proposed model is applicable  to a wide range of discrete data, such as image data, where each image is represented by visual words,  and purchase  log data, where each user is represented by a set of  items the user purchased.

CONCLUSION 

We proposed a topic model to find cluster matching without  alignment information for discrete data with multiple  domains. The proposed model has a set of topic proportion  vectors shared a different languages. By assigning a  topic proportion vector for each document, documents in all  languages are clustered in a common space. The documents  assigned into the same cluster are considered as matched.  In the experiments, we confirmed that the proposed model  could perform better than a combination of clustering and  unsupervised object matching. We also showed that the  proposed model could extract shared topics from real multilingual  text data sets without dictionaries and parallel  documents.  For future work, we will extend the proposed model  for a semi-supervised setting, where a small number of  correspondence information is available. With the proposed  model, the number of topics and concentration parameter  are hyperparameters to be set by users.

REFERENCES 

[1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,”  Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.

[2] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings  of the National Academy of Sciences, vol. 101 Suppl 1, pp.  5228–5235, 2004.

[3] T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings  of Conference on Uncertainty in Artificial Intelligence, 1999, pp. 289–  296.

[4] ——, “Collaborative filtering via Gaussian probabilistic latent  semantic analysis,” in Proceedings of the Annual International ACM  SIGIR Conference on Research and Development in Information Retrieval,  2003, pp. 259–266.

[5] D. M. Blei and M. I. Jordan, “Modeling annotated data,” in  Proceedings of the Annual International ACM SIGIR Conference on  Research and Development in Information Retrieval, 2003, pp. 127–134.

[6] L. Cao and L. Fei-Fei, “Spatially coherent latent topic model for  concurrent object segmentation and classification.” in Proceedings  of IEEE Intern. Conf. in Computer Vision (ICCV), 2007.

[7] A. Tripathi, A. Klami, and S. Virpioja, “Bilingual sentence matching  using kernel CCA,” in MLSP ’10: Proceedings of the 2010 IEEE  International Workshop on Machine Learning for Signal Processing,  2010, pp. 130–135.

[8] R. Socher and L. Fei-Fei, “Connecting modalities: Semi-supervised  segmentation and annotation of images using unaligned text  corpora,” in Proceedings of the IEEE Conference on Computer Vision  and Pattern Recognition, ser. CVPR, 2010, pp. 966–973.

[9] B. Li, Q. Yang, and X. Xue, “Transfer learning for collaborative  filtering via a rating-matrix generative model,” in Proceedings of  the 26th Annual International Conference on Machine Learning, ser.  ICML ’09, 2009, pp. 617–624.

[10] I. P. Fellegi and A. B. Sunter, “A theory for record linkage,” Journal  of the American Statistical Association, vol. 64, no. 328, pp. 1183–  1210, 1969.

[11] N. Quadrianto, A. J. Smola, L. Song, and T. Tuytelaars, “Kernelized  sorting,” IEEE Trans. on Pattern Analysis and Machine Intelligence,  vol. 32, no. 10, pp. 1809–1821, 2010.

[12] A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein, “Learning  bilingual lexicons from monolingual corpora,” in Proceedings of  ACL-08: HLT, 2008, pp. 771–779.

[13] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and  A. McCallum, “Polylingual topic models,” in Proceedings of the  2009 Conference on Empirical Methods in Natural Language Processing,  2009, pp. 880–889.

[14] D. Zhang, Q. Mei, and C. Zhai, “Cross-lingual latent topic extraction,”  in Proceedings of the 48th Annual Meeting of the Association for  Computational Linguistics, 2010, pp. 1128–1137.

[15] J. Jagarlamudi and H. Daum´e III, “Extracting multilingual topics  from unaligned comparable corpora,” in Advances in Information  Retrieval. Springer, 2010, pp. 444–456.