Table of Contents
Fetching ...

Teaching LLMs to Abstain across Languages via Multilingual Feedback

Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Orevaoghene Ahia, Shuyue Stella Li, Vidhisha Balachandran, Sunayana Sitaram, Yulia Tsvetkov

TL;DR

This work extends the abstention paradigm to multilingual LLMs by teaching models to abstain through multilingual feedback generated in related languages. The approach addresses the degradation of abstention performance in low-resource languages seen with English-centric methods, and demonstrates gains up to 9.2% in abstain accuracy on low-resource languages across multiple models and QA datasets. A key insight is that language relatedness and cultural context influence feedback quality and abstention behavior, making abstention a language-specific rather than universal problem. The authors show that using related languages for feedback yields more equitable utility and reveals that a smaller multilingual model can supervise a larger general-purpose LLM to improve reliability in long-tail languages. These findings highlight the social and linguistic dimensions of trustworthy multilingual NLP and point to future work in culturally aware, multilingual model design.

Abstract

Multilingual LLMs often have knowledge disparities across languages, with larger gaps in under-resourced languages. Teaching LLMs to abstain in the face of knowledge gaps is thus a promising strategy to mitigate hallucinations in multilingual settings. However, previous studies on LLM abstention primarily focus on English; we find that directly applying existing solutions beyond English results in up to 20.5% performance gaps between high and low-resource languages, potentially due to LLMs' drop in calibration and reasoning beyond a few resource-rich languages. To this end, we propose strategies to enhance LLM abstention by learning from multilingual feedback, where LLMs self-reflect on proposed answers in one language by generating multiple feedback items in related languages: we show that this helps identifying the knowledge gaps across diverse languages, cultures, and communities. Extensive experiments demonstrate that our multilingual feedback approach outperforms various strong baselines, achieving up to 9.2% improvement for low-resource languages across three black-box and open models on three datasets, featuring open-book, closed-book, and commonsense QA. Further analysis reveals that multilingual feedback is both an effective and a more equitable abstain strategy to serve diverse language speakers, and cultural factors have great impact on language selection and LLM abstention behavior, highlighting future directions for multilingual and multi-cultural reliable language modeling.

Teaching LLMs to Abstain across Languages via Multilingual Feedback

TL;DR

This work extends the abstention paradigm to multilingual LLMs by teaching models to abstain through multilingual feedback generated in related languages. The approach addresses the degradation of abstention performance in low-resource languages seen with English-centric methods, and demonstrates gains up to 9.2% in abstain accuracy on low-resource languages across multiple models and QA datasets. A key insight is that language relatedness and cultural context influence feedback quality and abstention behavior, making abstention a language-specific rather than universal problem. The authors show that using related languages for feedback yields more equitable utility and reveals that a smaller multilingual model can supervise a larger general-purpose LLM to improve reliability in long-tail languages. These findings highlight the social and linguistic dimensions of trustworthy multilingual NLP and point to future work in culturally aware, multilingual model design.

Abstract

Multilingual LLMs often have knowledge disparities across languages, with larger gaps in under-resourced languages. Teaching LLMs to abstain in the face of knowledge gaps is thus a promising strategy to mitigate hallucinations in multilingual settings. However, previous studies on LLM abstention primarily focus on English; we find that directly applying existing solutions beyond English results in up to 20.5% performance gaps between high and low-resource languages, potentially due to LLMs' drop in calibration and reasoning beyond a few resource-rich languages. To this end, we propose strategies to enhance LLM abstention by learning from multilingual feedback, where LLMs self-reflect on proposed answers in one language by generating multiple feedback items in related languages: we show that this helps identifying the knowledge gaps across diverse languages, cultures, and communities. Extensive experiments demonstrate that our multilingual feedback approach outperforms various strong baselines, achieving up to 9.2% improvement for low-resource languages across three black-box and open models on three datasets, featuring open-book, closed-book, and commonsense QA. Further analysis reveals that multilingual feedback is both an effective and a more equitable abstain strategy to serve diverse language speakers, and cultural factors have great impact on language selection and LLM abstention behavior, highlighting future directions for multilingual and multi-cultural reliable language modeling.
Paper Structure (44 sections, 2 equations, 9 figures, 16 tables)

This paper contains 44 sections, 2 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Average accuracy of abstention baselines in low- and high-resource languages with Aya-13B. Existing abstain strategies drop by 8.4% on average when applied to QA in low-resource languages.
  • Figure 2: Overview of abstaining via multilingual feedback. LLMs generate feedback on its proposed answer in four language settings to make abstain decisions.
  • Figure 3: GPT-4 evaluation of the role of the $k$ pieces of feedback to a given question: whether they are similar, complementary, conflicting, or unrelated. We also present the abstain accuracy in each feedback role below the role name, showing that conflicting feedback are best for LLM self-reflection while Multi-related has the most conflicting feedback scenarios.
  • Figure 4: GPT-4 evaluation of feedback pairs to the same question, comparing Multi-related against other feedback settings to evaluate which produces more relevant and informative feedback.
  • Figure 5: Overlap of abstain decisions made in different languages, where the overlap indicates that LLMs abstain in both/all three of the languages. We find that abstain decisions are only somewhat transferrable between relevant and high-resource language clusters.
  • ...and 4 more figures