Eliciting Better Multilingual Structured Reasoning from LLMs through Code
Bryan Li, Tamer Alkhouli, Daniele Bonadiman, Nikolaos Pappas, Saab Mansour
TL;DR
This work tackles multilingual complex reasoning by introducing xSTREET, a benchmark spanning 6 languages and 4 tasks to evaluate structured reasoning. It proposes two complementary remedies: (i) train-time indirect supervision by fine-tuning on a translated, multilingual code-comment dataset (Tcc) using LoRA to preserve base capabilities, and (ii) inference-time code-like prompts (Sim) that encode reasoning steps as function calls and multilingual text. Across BLOOMZ, Falcon, and GPT-3, the methods yield consistent improvements on multilingual reasoning tasks, especially ARC, while largely preserving performance on non-complex tasks. The results suggest that leveraging code as a reasoning scaffold—both in training data and prompting format—provides a language-agnostic inductive bias that enhances multilingual structured reasoning in open-source LLMs, though gains on more math-intensive tasks with smaller models remain limited. The work highlights the potential of integrating code-based supervision and code-inspired prompts to broaden reasoning capabilities across languages and domains.
Abstract
The development of large language models (LLM) has shown progress on reasoning, though studies have largely considered either English or simple reasoning tasks. To address this, we introduce a multilingual structured reasoning and explanation dataset, termed xSTREET, that covers four tasks across six languages. xSTREET exposes a gap in base LLM performance between English and non-English reasoning tasks. We then propose two methods to remedy this gap, building on the insight that LLMs trained on code are better reasoners. First, at training time, we augment a code dataset with multilingual comments using machine translation while keeping program code as-is. Second, at inference time, we bridge the gap between training and inference by employing a prompt structure that incorporates step-by-step code primitives to derive new facts and find a solution. Our methods show improved multilingual performance on xSTREET, most notably on the scientific commonsense reasoning subtask. Furthermore, the models show no regression on non-reasoning tasks, thus demonstrating our techniques maintain general-purpose abilities.
