Table of Contents
Fetching ...

Multilingual Contrastive Decoding via Language-Agnostic Layers Skipping

Wenhao Zhu, Sizhe Liu, Shujian Huang, Shuaijie She, Chris Wendler, Jiajun Chen

TL;DR

This paper addresses multilingual text generation quality by diagnosing the failure of contrastive decoding DoLa due to language-mismatch between early-exit and final logits. It introduces language-agnostic layer skipping, with two strategies SL-H and SL-D, to preserve upper transformer blocks and generate more helpful amateur logits, formalized by skipping a span [m, n) and computing $p_a = f_{out}(h_N)$. Experiments on multilingual mGSM across 11 languages show that the approach outperforms DoLa and reduces the need for a separate amateur model, improving chain-of-thought reasoning across diverse LLMs. By connecting results to the language-transition three-phase pattern in LLMs, the work provides practical guidance for robust multilingual decoding and highlights potential trade-offs in inference cost.

Abstract

Decoding by contrasting layers (DoLa), is designed to improve the generation quality of large language models (LLMs) by contrasting the prediction probabilities between an early exit output (amateur logits) and the final output (expert logits). However, we find that this approach does not work well on non-English tasks. Inspired by previous interpretability work on language transition during the model's forward pass, we discover that this issue arises from a language mismatch between early exit output and final output. In this work, we propose an improved contrastive decoding algorithm that is effective for diverse languages beyond English. To obtain more helpful amateur logits, we devise two strategies to skip a set of bottom, language-agnostic layers based on our preliminary analysis. Experimental results on multilingual reasoning benchmarks demonstrate that our proposed method outperforms previous contrastive decoding baselines and substantially improves LLM's chain-of-thought reasoning accuracy across 11 languages. The project will be available at: https://github.com/NJUNLP/SkipLayerCD.

Multilingual Contrastive Decoding via Language-Agnostic Layers Skipping

TL;DR

This paper addresses multilingual text generation quality by diagnosing the failure of contrastive decoding DoLa due to language-mismatch between early-exit and final logits. It introduces language-agnostic layer skipping, with two strategies SL-H and SL-D, to preserve upper transformer blocks and generate more helpful amateur logits, formalized by skipping a span [m, n) and computing . Experiments on multilingual mGSM across 11 languages show that the approach outperforms DoLa and reduces the need for a separate amateur model, improving chain-of-thought reasoning across diverse LLMs. By connecting results to the language-transition three-phase pattern in LLMs, the work provides practical guidance for robust multilingual decoding and highlights potential trade-offs in inference cost.

Abstract

Decoding by contrasting layers (DoLa), is designed to improve the generation quality of large language models (LLMs) by contrasting the prediction probabilities between an early exit output (amateur logits) and the final output (expert logits). However, we find that this approach does not work well on non-English tasks. Inspired by previous interpretability work on language transition during the model's forward pass, we discover that this issue arises from a language mismatch between early exit output and final output. In this work, we propose an improved contrastive decoding algorithm that is effective for diverse languages beyond English. To obtain more helpful amateur logits, we devise two strategies to skip a set of bottom, language-agnostic layers based on our preliminary analysis. Experimental results on multilingual reasoning benchmarks demonstrate that our proposed method outperforms previous contrastive decoding baselines and substantially improves LLM's chain-of-thought reasoning accuracy across 11 languages. The project will be available at: https://github.com/NJUNLP/SkipLayerCD.
Paper Structure (25 sections, 3 equations, 3 figures, 5 tables)

This paper contains 25 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of the superiority of our proposed layer skipping contrastive decoding algorithm over direct inference and DoLa.
  • Figure 2: The ratio of generating Chinese tokens for each layer of Mistral-7B on solving the mGSM task (Chinese part) with chain-of-thought.
  • Figure 3: Illustration of our devised contrastive decoding approach. The idea of the line chart and three phases division are borrowed from the work of wendler2024llamas. In the line chart, "probability" denotes the token generation probability and "entropy" denotes the entropy of the prediction distribution.