Anti-LM Decoding for Zero-shot In-context Machine Translation

Suzanna Sia; Alexandra DeLucia; Kevin Duh

Anti-LM Decoding for Zero-shot In-context Machine Translation

Suzanna Sia, Alexandra DeLucia, Kevin Duh

TL;DR

This work tackles zero-shot in-context translation with large language models by addressing the strong prior bias toward source-language continuations. It introduces Anti-LM decoding with an exponential decay, formalized as $\mathrm{ALM}(x) = \log p(y_t|y_{<t}, x, u) - \gamma^{t} \log p(y_1|x)$, which penalizes continuing the input sentence and thereby reduces non-translated outputs. Across three model families (XGLM, Bloom, Llama 2), three language directions, and both greedy and beam-search decoding, Anti-LM outperforms competitive contrastive objectives (e.g., PMI-based methods), achieving up to $\approx 20$ BLEU point gains in some settings and notably addressing the “failure to translate” cases. The method requires only a single contrastive-logit computation per source sentence, offering low latency while improving translation faithfulness, especially when prompts are not optimally crafted. Overall, the study demonstrates that calibrated decoding objectives can substantially enhance zero-shot translation performance with minimal computational overhead and without additional training or ensembling.

Abstract

Zero-shot In-context learning is the phenomenon where models can perform the task simply given the instructions. However, pre-trained large language models are known to be poorly calibrated for this task. One of the most effective approaches to handling this bias is to adopt a contrastive decoding objective, which accounts for the prior probability of generating the next token by conditioning on some context. This work introduces an Anti-Language Model objective with a decay factor designed to address the weaknesses of In-context Machine Translation. We conduct our experiments across 3 model types and sizes, 3 language directions, and for both greedy decoding and beam search ($B=5$). The proposed method outperforms other state-of-art decoding objectives, with up to $20$ BLEU point improvement from the default objective observed in some settings.

Anti-LM Decoding for Zero-shot In-context Machine Translation

TL;DR

, which penalizes continuing the input sentence and thereby reduces non-translated outputs. Across three model families (XGLM, Bloom, Llama 2), three language directions, and both greedy and beam-search decoding, Anti-LM outperforms competitive contrastive objectives (e.g., PMI-based methods), achieving up to

BLEU point gains in some settings and notably addressing the “failure to translate” cases. The method requires only a single contrastive-logit computation per source sentence, offering low latency while improving translation faithfulness, especially when prompts are not optimally crafted. Overall, the study demonstrates that calibrated decoding objectives can substantially enhance zero-shot translation performance with minimal computational overhead and without additional training or ensembling.

Abstract

). The proposed method outperforms other state-of-art decoding objectives, with up to

BLEU point improvement from the default objective observed in some settings.

Paper Structure (43 sections, 4 equations, 4 figures, 15 tables)

This paper contains 43 sections, 4 equations, 4 figures, 15 tables.

Introduction
Related Work and Background
Zero-shot MT.
Contrastive Decoding
Method
Problem Formulation
PMI Decoding (Previous Work)
Conditional $\mathrm{PMI}$ Decoding
Anti-LM Contrastive Decoding
Latency.
Experiments
Decoding Objectives
Models.
Data and Evaluation.
Generation Settings.
...and 28 more sections

Figures (4)

Figure 1: Number of non-target French sentences generated given the task Translate English to French which indicates a failure to translate.
Figure 2: Proportion of failure to translate vs successful cases averaged across all models. For successful cases, we compute whether the Anti-LM objective improves, degrades, or does not affect the performance.
Figure 3: $\gamma$ sweep for $\texttt{en} \rightarrow \texttt{pt}$.
Figure 4: Weighting by different options for the decay functions, starting at 0.3 and ending at 0 in the \ref{['eq:anti-lm']}. Aside from exponential, all decay functions have the additional parameter of when the function reaches 0, we set this to 10 timesteps.

Anti-LM Decoding for Zero-shot In-context Machine Translation

TL;DR

Abstract

Anti-LM Decoding for Zero-shot In-context Machine Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)