Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments
Patomporn Payoungkhamdee, Pume Tuchinda, Jinheon Baek, Samuel Cahyawijaya, Can Udomcharoenchaikit, Potsawee Manakul, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong
TL;DR
This work investigates Program-of-Thought (PoT) prompting for cross-lingual and multilingual reasoning by decoupling reasoning from execution and evaluating how fine-tuning affects question-reasoning alignment, as well as how reasoning quality translates to answer accuracy. It introduces an experimental framework and dataset variants to study Q-R alignment (P1) and R-A quality implications (P2), revealing that PoT fine-tuning substantially improves multilingual performance over CoT-fine-tuned baselines. The study establishes a strong correlation between intermediate code quality (ICE-Score) and final answer accuracy and demonstrates that ICE-Score-guided Soft Self-Consistency can dramatically boost test-time performance, particularly for low-resource languages. Together, these results highlight PoT’s potential to advance multilingual reasoning and provide practical guidance for alignment strategies and inference-time improvements. The findings have practical implications for deploying multilingual LLMs in reasoning-heavy tasks where execution is delegated to external interpreters.
Abstract
Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.
