Table of Contents
Fetching ...

What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

Jeremias Ferrao, Ezgi Basar, Khondoker Ittehadul Islam, Mahrokh Hassani

TL;DR

This work probes the faithfulness and interpretability of multilingual chain-of-thought reasoning by applying step-wise ContextCite and token-level Inseq attribution to a 1.5B-parameter multilingual LLM (Qwen2.5 Instruct) on MGSM. By enforcing structured CoT generation, the study compares NoCoT and CoT conditions across English, French, German, Bengali, and Chinese, revealing that structured CoT substantially boosts accuracy for high-resource languages but struggles for Bengali due to tokenization and data sparsity. ContextCite analyses show attribution concentrates on the final reasoning step, particularly in incorrect responses, while token-level attribution indicates a progressive increase in step importance toward the end of the chain; perturbations like negation and distractors degrade both accuracy and attribution coherence. The findings highlight multilingual robustness and interpretability challenges in CoT prompting and suggest directions for more reliable cross-lingual reasoning and attribution practices, including larger models and cross-lingual prompting strategies.

Abstract

This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.

What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

TL;DR

This work probes the faithfulness and interpretability of multilingual chain-of-thought reasoning by applying step-wise ContextCite and token-level Inseq attribution to a 1.5B-parameter multilingual LLM (Qwen2.5 Instruct) on MGSM. By enforcing structured CoT generation, the study compares NoCoT and CoT conditions across English, French, German, Bengali, and Chinese, revealing that structured CoT substantially boosts accuracy for high-resource languages but struggles for Bengali due to tokenization and data sparsity. ContextCite analyses show attribution concentrates on the final reasoning step, particularly in incorrect responses, while token-level attribution indicates a progressive increase in step importance toward the end of the chain; perturbations like negation and distractors degrade both accuracy and attribution coherence. The findings highlight multilingual robustness and interpretability challenges in CoT prompting and suggest directions for more reliable cross-lingual reasoning and attribution practices, including larger models and cross-lingual prompting strategies.

Abstract

This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.

Paper Structure

This paper contains 28 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Preliminary evaluation results on the English portion of the MGSM dataset using the conditions described in Section \ref{['sec:evaluation-contextcite']}.
  • Figure 2: Accuracy results for Qwen Instruct on MGSM across five languages. Generating only a direct answer (NoCoT-Unstruct) results in low accuracy (<10%). Introducing structured CoT (CoT-Struct) dramatically boosts performance, most significantly for English. However, this improvement trend is minimal for Bengali.
  • Figure 3: Mean token count (left) and reasoning steps (right) produced by Model along with standard errors.
  • Figure 4: Distribution of the highest-attributed reasoning step category (First/Preamble, Intermediate, Final) for Qwen Instruct on MGSM, based on ContextCite scores across five languages.
  • Figure 5: Heat maps of baseline French and German attributions.
  • ...and 9 more figures