In this paper, we propose and experiment a probabilistic approach to document classification. We consider the problem of automatically assigning a new article to a Usenet newsgroup. To model a newsgroup, we build a probabilistic language model which is supposed to generate articles for this newsgroup. When a new article is presented, we use a Maximum A Posteriori rule to decide if the message was generated by this newsgroup or not. We evaluate this approach and compare it to a classification based on keywords. On these cases, the probabilistic approach gives better recall and precision indicators. The paper is structured as follows: we first present the problem of document classification in general terms. We then describe our application to newsgroup classification and present the data that we are using. We present first results for a classification based on keyword selection. Finally, we describe the probabilistic formulation of the problem, experiment this approach on the same data and compare the results.
A probabilistic approach to document classification
SIGIR 1995, 18th Annual International ACM SIGIR conference on Research and Development in Information Retrieval, July 9-13, 1995, Seattle, Washington, USA
Poster / Demo
© ACM, 1995. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in SIGIR 1995, 18th Annual International ACM SIGIR conference on Research and Development in Information Retrieval, July 9-13, 1995, Seattle, Washington, USA
PERMALINK : https://www.eurecom.fr/publication/169