Topic Models for Unsupervised Cluster Matching
Abstract
We propose topic models for unsupervised cluster matching, which is the task of finding matching between clusters in different domains without correspondence information. For example, the proposed model finds correspondence between document clusters in English and German without alignment information, such as dictionaries and parallel sentences/documents. The proposed model assumes that documents in all languages have a common latent topic structure, and there are potentially infinite number of topic proportion vectors in a latent topic space that is shared by all languages. Each document is generated using one of the topic proportion vectors and language-specific word distributions. By inferring a topic proportion vector used for each document, we can allocate documents in different languages into common clusters, where each cluster is associated with a topic proportion vector. Documents assigned into the same cluster are considered to be matched. We develop an efficient inference procedure for the proposed model based on collapsed Gibbs sampling. The effectiveness of the proposed model is demonstrated with real data sets including multilingual corpora of Wikipedia and product reviews.
Existing System
There have been proposed a number of methods for unsupervised object matching, which is also called cross-domain object matching, such as kernelized sorting , and its convex extension, least square object matching, matching canonical correlation analysis , and Bayesian object matching. These methods find matching by sorting objects so as to maximize dependence, or minimize independence. For example, kernelized sorting uses Hilbert- Schmidt Independence Criterion (HSIC) as a measurement of independence, and sorts objects by minimizing HSIC between objects in two domains using the Hungarian algorithm. There are three limitations in these methods. First, they find only one-to-one matching. Second, the number of domains need to be two. Third, the number of objects in each domain should be the same across all domains. On the other hand, the proposed model does not have these limitations; it finds cluster matching from data with more than two domains, and each domain can contain different numbers of objects.
Proposed System
The proposed model is an unsupervised method for cluster matching, which is the task of finding matching between clusters in different domains, where correspondence and cluster information are unavailable. For example, the proposed model finds correspondence between document clusters in English and German without alignment information, such as dictionaries and parallel sentences/documents. Here, parallel sentences/documents mean that its German translation is attached to each sentence/document in English. Although we assume that the given data are text documents with multiple languages in this paper, where each language corresponds to a domain, the proposed model is applicable to a wide range of discrete data, such as image data, where each image is represented by visual words, and purchase log data, where each user is represented by a set of items the user purchased.
CONCLUSION
We proposed a topic model to find cluster matching without alignment information for discrete data with multiple domains. The proposed model has a set of topic proportion vectors shared a different languages. By assigning a topic proportion vector for each document, documents in all languages are clustered in a common space. The documents assigned into the same cluster are considered as matched. In the experiments, we confirmed that the proposed model could perform better than a combination of clustering and unsupervised object matching. We also showed that the proposed model could extract shared topics from real multilingual text data sets without dictionaries and parallel documents. For future work, we will extend the proposed model for a semi-supervised setting, where a small number of correspondence information is available. With the proposed model, the number of topics and concentration parameter are hyperparameters to be set by users.
REFERENCES
[1] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
[2] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Sciences, vol. 101 Suppl 1, pp. 5228–5235, 2004.
[3] T. Hofmann, “Probabilistic latent semantic analysis,” in Proceedings of Conference on Uncertainty in Artificial Intelligence, 1999, pp. 289– 296.
[4] ——, “Collaborative filtering via Gaussian probabilistic latent semantic analysis,” in Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, pp. 259–266.
[5] D. M. Blei and M. I. Jordan, “Modeling annotated data,” in Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, pp. 127–134.
[6] L. Cao and L. Fei-Fei, “Spatially coherent latent topic model for concurrent object segmentation and classification.” in Proceedings of IEEE Intern. Conf. in Computer Vision (ICCV), 2007.
[7] A. Tripathi, A. Klami, and S. Virpioja, “Bilingual sentence matching using kernel CCA,” in MLSP ’10: Proceedings of the 2010 IEEE International Workshop on Machine Learning for Signal Processing, 2010, pp. 130–135.
[8] R. Socher and L. Fei-Fei, “Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, ser. CVPR, 2010, pp. 966–973.
[9] B. Li, Q. Yang, and X. Xue, “Transfer learning for collaborative filtering via a rating-matrix generative model,” in Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML ’09, 2009, pp. 617–624.
[10] I. P. Fellegi and A. B. Sunter, “A theory for record linkage,” Journal of the American Statistical Association, vol. 64, no. 328, pp. 1183– 1210, 1969.
[11] N. Quadrianto, A. J. Smola, L. Song, and T. Tuytelaars, “Kernelized sorting,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 32, no. 10, pp. 1809–1821, 2010.
[12] A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein, “Learning bilingual lexicons from monolingual corpora,” in Proceedings of ACL-08: HLT, 2008, pp. 771–779.
[13] D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and A. McCallum, “Polylingual topic models,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009, pp. 880–889.
[14] D. Zhang, Q. Mei, and C. Zhai, “Cross-lingual latent topic extraction,” in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 2010, pp. 1128–1137.
[15] J. Jagarlamudi and H. Daum´e III, “Extracting multilingual topics from unaligned comparable corpora,” in Advances in Information Retrieval. Springer, 2010, pp. 444–456.