Table of Contents
Fetching ...

Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udomcharoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong

TL;DR

This work investigates Program-of-Thought (PoT) prompting for cross-lingual and multilingual reasoning by decoupling reasoning from execution and evaluating how fine-tuning affects question-reasoning alignment, as well as how reasoning quality translates to answer accuracy. It introduces an experimental framework and dataset variants to study Q-R alignment (P1) and R-A quality implications (P2), revealing that PoT fine-tuning substantially improves multilingual performance over CoT-fine-tuned baselines. The study establishes a strong correlation between intermediate code quality (ICE-Score) and final answer accuracy and demonstrates that ICE-Score-guided Soft Self-Consistency can dramatically boost test-time performance, particularly for low-resource languages. Together, these results highlight PoT’s potential to advance multilingual reasoning and provide practical guidance for alignment strategies and inference-time improvements. The findings have practical implications for deploying multilingual LLMs in reasoning-heavy tasks where execution is delegated to external interpreters.

Abstract

Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.

Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments

TL;DR

This work investigates Program-of-Thought (PoT) prompting for cross-lingual and multilingual reasoning by decoupling reasoning from execution and evaluating how fine-tuning affects question-reasoning alignment, as well as how reasoning quality translates to answer accuracy. It introduces an experimental framework and dataset variants to study Q-R alignment (P1) and R-A quality implications (P2), revealing that PoT fine-tuning substantially improves multilingual performance over CoT-fine-tuned baselines. The study establishes a strong correlation between intermediate code quality (ICE-Score) and final answer accuracy and demonstrates that ICE-Score-guided Soft Self-Consistency can dramatically boost test-time performance, particularly for low-resource languages. Together, these results highlight PoT’s potential to advance multilingual reasoning and provide practical guidance for alignment strategies and inference-time improvements. The findings have practical implications for deploying multilingual LLMs in reasoning-heavy tasks where execution is delegated to external interpreters.

Abstract

Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.

Paper Structure

This paper contains 28 sections, 8 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Proposed experimental framework under the PoT workflow $Q \rightarrow R \rightarrow A$. P1: Aligning multilingual questions (Q) with reasoning steps (R) through fine-tuning and inline comments. P2: Assessing the correlation between reasoning steps (R) and final answers (A) through code quality and test-time inference.
  • Figure 2: The generation pipeline for GSM8KPoT, in which a PoT answer ($\vb*{R}_i^\text{en}$) is synthesized using the Oracle LLM, with additional natural language reasoning ($\vb*{C}_i^\text{en}$) provided as guidance.
  • Figure 3: The relationship between code quality and answer accuracy in cross-lingual and multilingual settings. Each point represents a given language, considering a specific system and model combination.
  • Figure 4: Zero-shot PoT prompt template for PoT synthesis, where [Question] serves as a placeholder for the problem statement.
  • Figure 5: Few-shot PoT prompt template for PoT synthesis, with exemplars adapted from palpot.
  • ...and 2 more figures