Automatic Clustering of e-Commerce Product Description
|Published in:||Issue 2, (Vol. 6) / 2012|
|Author(s):||SALEEM Al-SARRAYRIH Haytham, KNIPPING Lars, PETCU Carmen|
|Abstract.||Resolving the issue of storing large amounts of digital information is a challenge, searching for a certain object within a tremendous amount of data is like looking for a needle in a haystack. The increase in size and diversity of stored data makes the retrieval of the information needed more and more difficult. This research describes the use of clustering techniques and mathematical models in the field of information retrieval when dealing with text documents. In this study, the traditional clustering and clustering extended by LSA are compared by applying them on the preprocessed text corpus using the weighted centroid clustering algorithm and the cosine similarity to measure the documents' correlation. LSA is assumed to improve the clustering by bringing related words closer in a conceptual space. It is deduced that the clustering depends on the document representation and the similarity measure used. When dealing with short documents, LSA does not bring yield improved results compared to the traditional clustering techniques. The recall value is nevertheless higher because of the increased number of related documents returned. However, the results are less accurate than with traditional techniques.|
|Keywords:||Information Retrieval, Classification, Hierarchical Clustering, Latent Semantic Analysis.|
1. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze (2009), “Introduction to Information Retrieval”, Cambridge University Press.
2. William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery (2007), “Numerical Recipes: The Art of Scientific Computing”, Cambridge University Press, 3th edition, 2007.
3. F. Beil, M. Ester, X. Xu (2002), “Frequent term-based text clustering”, in Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp 436–442.
4. T. K. Landauer, P.W. Foltz, and D. Laham (1998). “Introduction to Latent Semantic Analysis”, Discourse Processes, 25, pp 259–284.
5. J. R. Paulsen and H. Ramampiaro (2009), “Combining Latent Semantic Indexing and Clustering to Retrieve and Cluster Biomedical Information: A 2-step Approach”, Norsk Informatikonferanse (NIK).
6. S. Deerwester, S. T. Dumais, G. Furnas, T. Landauer, and R. Harshman (1990), “Indexing by Latent Semantic Analysis”, Journal of the American Society for Information Science, Volume 41, No 6, pp 391–407.
7. P. Wiemer-Hastings (1999), “Latent Semantic Analysis”, in Proceedings of the Sixteenth International Joint conference on Artificial Intelligence”, Volume 16, No 2, pp 932–941.
8. A. Roseline, Chris F., and K. Udo (2011), “The use of Latent Semantic Indexing to Cluster Documents into their Subject”, in Proceedings of the 5th Language and Technology Conference (LTC 2011), November 25–27, 2011, Poznań, Poland.
9. N. Polettini (2004), “The Vector Space Model in Information Retrieval—Term Weighting Problem”.
10. A. Jain., M. Murty, and P. Flynn (1999), “Data Clustering: A Review”, ACM Computing Surveys, Volume 31, No 3, pp 264–323.
11. A. Hotho, A. Nurnberger, and G. Paab (2005), “A Brief Survey of Text Mining”, GLDV-Journal for Computational Linguistics and Language Technology, pp 19–62.
12. R. B. Yates and B. R. Neto (1999), “Modern Information Retrieval”, ADDISON-WESLEY, New York.
|Back to the journal content|
This article is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License.