Table of Contents
Fetching ...

Constrained Decoding for Cross-lingual Label Projection

Duong Minh Le, Yang Chen, Alan Ritter, Wei Xu

TL;DR

Codec addresses the deterioration of translation quality in label projection for cross-lingual span labeling by separating translation from marker insertion and applying constrained decoding. It formalizes label projection as a constrained generation problem and introduces a practical approximation with pruning, top-$k$ hypothesis search, and re-ranking to project span-level annotations from high-resource languages to low-resource ones. Across NER and EAE tasks in 20 languages, Codec consistently outperforms marker-based and alignment-based baselines, with translate-test often yielding larger gains. The approach enables strong translate-train and translate-test cross-lingual transfer, preserving translation quality while ensuring correct marker placement and span mappings, thus improving fine-grained cross-lingual labeling in low-resource settings.

Abstract

Zero-shot cross-lingual transfer utilizing multilingual LLMs has become a popular learning paradigm for low-resource languages with no labeled training data. However, for NLP tasks that involve fine-grained predictions on words and phrases, the performance of zero-shot cross-lingual transfer learning lags far behind supervised fine-tuning methods. Therefore, it is common to exploit translation and label projection to further improve the performance by (1) translating training data that is available in a high-resource language (e.g., English) together with the gold labels into low-resource languages, and/or (2) translating test data in low-resource languages to a high-source language to run inference on, then projecting the predicted span-level labels back onto the original test data. However, state-of-the-art marker-based label projection methods suffer from translation quality degradation due to the extra label markers injected in the input to the translation model. In this work, we explore a new direction that leverages constrained decoding for label projection to overcome the aforementioned issues. Our new method not only can preserve the quality of translated texts but also has the versatility of being applicable to both translating training and translating test data strategies. This versatility is crucial as our experiments reveal that translating test data can lead to a considerable boost in performance compared to translating only training data. We evaluate on two cross-lingual transfer tasks, namely Named Entity Recognition and Event Argument Extraction, spanning 20 languages. The results demonstrate that our approach outperforms the state-of-the-art marker-based method by a large margin and also shows better performance than other label projection methods that rely on external word alignment.

Constrained Decoding for Cross-lingual Label Projection

TL;DR

Codec addresses the deterioration of translation quality in label projection for cross-lingual span labeling by separating translation from marker insertion and applying constrained decoding. It formalizes label projection as a constrained generation problem and introduces a practical approximation with pruning, top- hypothesis search, and re-ranking to project span-level annotations from high-resource languages to low-resource ones. Across NER and EAE tasks in 20 languages, Codec consistently outperforms marker-based and alignment-based baselines, with translate-test often yielding larger gains. The approach enables strong translate-train and translate-test cross-lingual transfer, preserving translation quality while ensuring correct marker placement and span mappings, thus improving fine-grained cross-lingual labeling in low-resource settings.

Abstract

Zero-shot cross-lingual transfer utilizing multilingual LLMs has become a popular learning paradigm for low-resource languages with no labeled training data. However, for NLP tasks that involve fine-grained predictions on words and phrases, the performance of zero-shot cross-lingual transfer learning lags far behind supervised fine-tuning methods. Therefore, it is common to exploit translation and label projection to further improve the performance by (1) translating training data that is available in a high-resource language (e.g., English) together with the gold labels into low-resource languages, and/or (2) translating test data in low-resource languages to a high-source language to run inference on, then projecting the predicted span-level labels back onto the original test data. However, state-of-the-art marker-based label projection methods suffer from translation quality degradation due to the extra label markers injected in the input to the translation model. In this work, we explore a new direction that leverages constrained decoding for label projection to overcome the aforementioned issues. Our new method not only can preserve the quality of translated texts but also has the versatility of being applicable to both translating training and translating test data strategies. This versatility is crucial as our experiments reveal that translating test data can lead to a considerable boost in performance compared to translating only training data. We evaluate on two cross-lingual transfer tasks, namely Named Entity Recognition and Event Argument Extraction, spanning 20 languages. The results demonstrate that our approach outperforms the state-of-the-art marker-based method by a large margin and also shows better performance than other label projection methods that rely on external word alignment.
Paper Structure (39 sections, 7 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 39 sections, 7 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: The goal of label projection is to automatically construct annotated data in low-resource language (e.g., Bambara) by translating from annotated data in high-resource language (e.g., English) while preserving the span-level labels. EasyProject (Left): The source sentence is first injected with marker pairs around entities and then translated to the target language; a span-matching step is performed to map each span to the corresponding label (i.e., types of entities). This method has issues related to the translation quality (e.g., the word "France" is not properly translated), due to the existence of markers in the input to the translation model. Codec (Right): The source sentence is first translated to the target language, given the source sentence without marker injected, Codec then performs constrained decoding to insert markers to the translated sentence.
  • Figure 2: The first two steps of Codec. Step 1 (Left): Codec prunes search branches based on unlikely opening marker positions in the target language by comparing probabilities conditioned on the source language with and without span markers inserted. Step 2 (Right): A branch-and-bound search algorithm is used to find $k$ hypotheses with the highest probabilities ($k=3$). From each node of the search tree (e.g., "Faransi"), Codec expands to the next token from the translation template $y^{tmpl}$ (e.g., "ni") or a marker (i.e., "[" or "]"). A search branch is pruned if its score falls below a heuristic lower bound. Two branches of different lengths might have different values of lower bound (e.g., the lower bound for the top and bottom branches are -3.1 and -2.8, respectively).
  • Figure 3: Ablation study on MasakhaNER2.0 dev set of different search settings for five languages, including Bambara (bam), Fon (fon), Mossi (mos), Yoruba (yor), and isiZulu (zul). exact (+re-rank): exact search with re-ranking; exact: exact search and return the top-1 hypothesis. '$\delta$' is the hyperparameter of the heuristic lower bound. '+ [' indicates pruning unlikely opening-marker positions beforehand. Compared to constrained decoding with exact search, Codec with $\delta=3$ significantly reduces the decoding time, while retaining the performance measured by F1 scores.
  • Figure 4: Examples of using different approaches to project label spans from English to low-resource languages (i.e., chiShona (middle) and isiZulu (bottom)) in the translate-train setting for Cross-lingual NER. In each example, label spans in English data and their corresponding projections in the target language have the same color, the projection errors are underline. In the two examples: (1) EasyProject incorrectly splits some words and only marks a part of them as an entity (e.g., "Pakistan" instead of "nePakistan"); (2) Awes-align cannot project all label spans and incorrectly map "China" to "Imithombo" in the second example; (3) Codec has the correct projections in both examples.