Table of Contents
Fetching ...

Levée d'ambiguïtés par grammaires locales

Eric G. C. Laporte

TL;DR

This work tackles lexical disambiguation of part-of-speech tagging using a modular framework of local grammars implemented as finite-state transducers. It represents ambiguous tag sequences as acyclic automata and enforces disambiguation through hand-crafted local grammars that are applied after an initial tagging pass, with the goal of never discarding correct tag(s) (zero-silence rate). The paper formalizes how to verify grammar acceptance, showing that interactions between transducers must be considered and that combining grammars can both alleviate and introduce errors. It argues for testing grammars thoroughly due to unforeseen constructions and ambiguities, and demonstrates the approach within the INTEX system, leveraging large morphological dictionaries and linguistic knowledge for robust French POS tagging. The method provides a principled, testable framework for integrating morphology, syntax, and lexical ambiguity in a way that supports reliable disambiguation in practical NLP pipelines.

Abstract

Many words are ambiguous in terms of their part of speech (POS). However, when a word appears in a text, this ambiguity is generally much reduced. Disambiguating POS involves using context to reduce the number of POS associated with words, and is one of the main challenges of lexical tagging. The problem of labeling words by POS frequently arises in natural language processing, for example for spelling correction, grammar or style checking, expression recognition, text-to-speech conversion, text corpus analysis, etc. Lexical tagging systems are thus useful as an initial component of many natural language processing systems. A number of recent lexical tagging systems produce multiple solutions when the text is lexically ambiguous or the uniquely correct solution cannot be found. These contributions aim to guarantee a zero silence rate: the correct tag(s) for a word must never be discarded. This objective is unrealistic for systems that tag each word uniquely. This article concerns a lexical disambiguation method adapted to the objective of a zero silence rate and implemented in Silberztein's INTEX system (1993). We present here a formal description of this method. We show that to verify a local disambiguation grammar in this framework, it is not sufficient to consider the transducer paths separately: one needs to verify their interactions. Similarly, if a combination of multiple transducers is used, the result cannot be predicted by considering them in isolation. Furthermore, when examining the initial labeling of a text as produced by INTEX, ideas for disambiguation rules come spontaneously, but grammatical intuitions may turn out to be inaccurate, often due to an unforeseen construction or ambiguity. If a zero silence rate is targeted, local grammars must be carefully tested. This is where a detailed specification of what a grammar will do once applied to texts would be necessary.

Levée d'ambiguïtés par grammaires locales

TL;DR

This work tackles lexical disambiguation of part-of-speech tagging using a modular framework of local grammars implemented as finite-state transducers. It represents ambiguous tag sequences as acyclic automata and enforces disambiguation through hand-crafted local grammars that are applied after an initial tagging pass, with the goal of never discarding correct tag(s) (zero-silence rate). The paper formalizes how to verify grammar acceptance, showing that interactions between transducers must be considered and that combining grammars can both alleviate and introduce errors. It argues for testing grammars thoroughly due to unforeseen constructions and ambiguities, and demonstrates the approach within the INTEX system, leveraging large morphological dictionaries and linguistic knowledge for robust French POS tagging. The method provides a principled, testable framework for integrating morphology, syntax, and lexical ambiguity in a way that supports reliable disambiguation in practical NLP pipelines.

Abstract

Many words are ambiguous in terms of their part of speech (POS). However, when a word appears in a text, this ambiguity is generally much reduced. Disambiguating POS involves using context to reduce the number of POS associated with words, and is one of the main challenges of lexical tagging. The problem of labeling words by POS frequently arises in natural language processing, for example for spelling correction, grammar or style checking, expression recognition, text-to-speech conversion, text corpus analysis, etc. Lexical tagging systems are thus useful as an initial component of many natural language processing systems. A number of recent lexical tagging systems produce multiple solutions when the text is lexically ambiguous or the uniquely correct solution cannot be found. These contributions aim to guarantee a zero silence rate: the correct tag(s) for a word must never be discarded. This objective is unrealistic for systems that tag each word uniquely. This article concerns a lexical disambiguation method adapted to the objective of a zero silence rate and implemented in Silberztein's INTEX system (1993). We present here a formal description of this method. We show that to verify a local disambiguation grammar in this framework, it is not sufficient to consider the transducer paths separately: one needs to verify their interactions. Similarly, if a combination of multiple transducers is used, the result cannot be predicted by considering them in isolation. Furthermore, when examining the initial labeling of a text as produced by INTEX, ideas for disambiguation rules come spontaneously, but grammatical intuitions may turn out to be inaccurate, often due to an unforeseen construction or ambiguity. If a zero silence rate is targeted, local grammars must be carefully tested. This is where a detailed specification of what a grammar will do once applied to texts would be necessary.

Paper Structure

This paper contains 15 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: un automate acyclique pour Il traverse le chemin de fer.
  • Figure 2: le transducteur $T_1$.
  • Figure 3: le transducteur $T_2$.
  • Figure 4: le transducteur $T_3$.
  • Figure 5: le transducteur $T_2 | T_3$.
  • ...and 4 more figures