Eliciting Better Multilingual Structured Reasoning from LLMs through Code

Bryan Li; Tamer Alkhouli; Daniele Bonadiman; Nikolaos Pappas; Saab Mansour

Eliciting Better Multilingual Structured Reasoning from LLMs through Code

Bryan Li, Tamer Alkhouli, Daniele Bonadiman, Nikolaos Pappas, Saab Mansour

TL;DR

This work tackles multilingual complex reasoning by introducing xSTREET, a benchmark spanning 6 languages and 4 tasks to evaluate structured reasoning. It proposes two complementary remedies: (i) train-time indirect supervision by fine-tuning on a translated, multilingual code-comment dataset (Tcc) using LoRA to preserve base capabilities, and (ii) inference-time code-like prompts (Sim) that encode reasoning steps as function calls and multilingual text. Across BLOOMZ, Falcon, and GPT-3, the methods yield consistent improvements on multilingual reasoning tasks, especially ARC, while largely preserving performance on non-complex tasks. The results suggest that leveraging code as a reasoning scaffold—both in training data and prompting format—provides a language-agnostic inductive bias that enhances multilingual structured reasoning in open-source LLMs, though gains on more math-intensive tasks with smaller models remain limited. The work highlights the potential of integrating code-based supervision and code-inspired prompts to broaden reasoning capabilities across languages and domains.

Abstract

The development of large language models (LLM) has shown progress on reasoning, though studies have largely considered either English or simple reasoning tasks. To address this, we introduce a multilingual structured reasoning and explanation dataset, termed xSTREET, that covers four tasks across six languages. xSTREET exposes a gap in base LLM performance between English and non-English reasoning tasks. We then propose two methods to remedy this gap, building on the insight that LLMs trained on code are better reasoners. First, at training time, we augment a code dataset with multilingual comments using machine translation while keeping program code as-is. Second, at inference time, we bridge the gap between training and inference by employing a prompt structure that incorporates step-by-step code primitives to derive new facts and find a solution. Our methods show improved multilingual performance on xSTREET, most notably on the scientific commonsense reasoning subtask. Furthermore, the models show no regression on non-reasoning tasks, thus demonstrating our techniques maintain general-purpose abilities.

Eliciting Better Multilingual Structured Reasoning from LLMs through Code

TL;DR

Abstract

Paper Structure (42 sections, 8 figures, 7 tables)

This paper contains 42 sections, 8 figures, 7 tables.

Introduction
Related Work
Code & Reasoning Hypothesis for LLMs
Code Prompts for Complex Reasoning
Multilingual Reasoning for LLMs
STREET Complex Reasoning Benchmark
Source Tasks
Linearized prompt format
Source Code Dataset
Multilingual Complex Reasoning Benchmark: xSTREET
Code with Multilingual Comments as Indirect Supervision for Reasoning
Translated Code Comments Dataset (Tcc)
Train Time: fine-tuning on Tcc
Multilingual Complex Reasoning as a Downstream Task
Multilingual code prompts
...and 27 more sections

Figures (8)

Figure 1: An overview of our methods to improve multilingual structured reasoning. First (top), we create the translated code comments (Tcc) dataset, and use it in a fine-tuning setup. Second (bottom), we use the resulting LLM for inference on reasoning tasks. We find the most success with a code prompt format that bridges the representations between training and inference.
Figure 2: The translation process for an xSTREET entry. We start from an example from STREET ribeiro2022street. The reasoning graphs are directly transferred, while each sentence text is translated. Note that this shows only one (of 4) task, GSM8K, and one (of 5) language, Spanish.
Figure 3: Depictions of 3 prompting formats for the xSTREET tasks. For each format, input is in a grey box, while expected output is in a white box. Top left: direct. Bottom left: linearized. Right: Sim code prompts (2 languages). In the code prompts, we color code facts which are aligned.
Figure 4: Results on ARC task of xSTREET, with BLOOMZ-based models. The random baseline is 25%. 'Avg' bars are across the 5 non-English languages. Linearized prompts use lines, while code prompts use dots.
Figure 5: Results on GSM8k, AQUA_RAT, AR_LSAT tasks of STREET (left) and xSTREET (right), with GPT-3. For each task, the random baseline is shown with a dotted line. xSTREET results are averaged over 5 languages.
...and 3 more figures

Eliciting Better Multilingual Structured Reasoning from LLMs through Code

TL;DR

Abstract

Eliciting Better Multilingual Structured Reasoning from LLMs through Code

Authors

TL;DR

Abstract

Table of Contents

Figures (8)