Table of Contents
Fetching ...

Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation

Khondoker Ittehadul Islam, Gabriele Sarti

TL;DR

Reveal-Bangla provides a manually translated Bangla subset of the Reveal dataset with annotated multi-step reasoning, enabling cross-lingual evaluation of small language models. The study compares English-centric and Bangla-centric models under two prompting regimes, examining how gold reasoning steps influence predictions and how Attribution (ContextCite) reveals step-wise importance across languages. Findings show reasoning improves performance mainly on non-binary questions and that Bangla reasoning steps receive disproportionate attention, highlighting language-specific needs for effective cross-lingual reasoning. Overall, the work underscores the limitations of directly porting English CoT methods to Bangla and motivates language-aware strategies and more robust attribution analyses for low-resource languages.

Abstract

Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models' predictions, highlighting different trends across models and languages.

Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation

TL;DR

Reveal-Bangla provides a manually translated Bangla subset of the Reveal dataset with annotated multi-step reasoning, enabling cross-lingual evaluation of small language models. The study compares English-centric and Bangla-centric models under two prompting regimes, examining how gold reasoning steps influence predictions and how Attribution (ContextCite) reveals step-wise importance across languages. Findings show reasoning improves performance mainly on non-binary questions and that Bangla reasoning steps receive disproportionate attention, highlighting language-specific needs for effective cross-lingual reasoning. Overall, the work underscores the limitations of directly porting English CoT methods to Bangla and motivates language-aware strategies and more robust attribution analyses for low-resource languages.

Abstract

Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models' predictions, highlighting different trends across models and languages.

Paper Structure

This paper contains 37 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: A Row instance of Reveal-Bangla containing translated Question, Evidence, Reasoning Steps and Answer from Reveal.
  • Figure 2: Accuracy of EngLlama and BenLlama for the gen_ans and w_cot_gen_ans settings on English and Bangla Reveal subsets.
  • Figure 3: EngLlama and BenLlama accuracy on Reveal Binary (top) and Non-Binary (bottom) questions.
  • Figure 4: Importance ratio for EngLlama and BenLlama on w_cot_gen_ans reasoning steps between $-4$ (lowest) and $+4$ (highest).
  • Figure 5: Distribution of Step Count and Token Distribution of Steps. Furthermore, interestingly, number of words required to describe a step in Bangla is less than of English.
  • ...and 3 more figures