Tagging english text with a probabilistic model

Mérialdo, Bernard
Computational linguistics, volume 20, Issue 2

In this paper we present some experiments on the use of a probabilistic model to tag English text, i.e. to assign to each word the correct tag (part of speech) in the context of the sentence. The main novelty of these experiments is the use of untagged text in the training of the model. We have used a simple triclass Markov model and are looking for the best way to estimate the parameters of this model, depending on the kind and amount of training data provided. Two approaches in particular are compared and combined: to use text that has been tagged by hand and compute relative fre- quency counts, to use text without tags and train the model as a hidden Markov process, according to a Maximum Likelihood principle. Experiments show that the best training is obtained by using as much tagged text as possible. They also show that Maximum Likelihood training, the pro- cedure which is routinely used to estimate Hidden Markov Models param- eters from training data, will not necessarily improve the tagging accuracy. In fact, it will generally degrade this accuracy, except when only a limited amount of hand tagged text is available.

Data Science
Eurecom Ref:
Copyright ACL. Personal use of this material is permitted. The definitive version of this paper was published in Computational linguistics, volume 20, Issue 2 and is available at :
See also:

PERMALINK : https://www.eurecom.fr/publication/136