Table of Contents
Fetching ...

On the Calibration of Multilingual Question Answering LLMs

Yahan Yang, Soham Dan, Dan Roth, Insup Lee

TL;DR

This work investigates confidence calibration in multilingual QA large language models across extractive, generative, and decoder-only architectures, focusing on cross-lingual transfer, in-distribution and OOD settings. It benchmarks five pre-trained MLLMs on XQuAD, MLQA, and TyDiQA, and evaluates calibration strategies including temperature scaling, label smoothing, and translation-based data augmentation, plus in-context learning for LLaMa2. The study finds that English calibration does not generalize well to other languages, with syntactic distance and pre-training data size strongly predicting calibration performance; temperature scaling on mixed-language validation data and cheap translations during fine-tuning significantly improve cross-lingual calibration, while ICL also offers gains. These results provide practical guidance for deploying multilingual QA systems and point to future work in broader language coverage and domain settings.

Abstract

Multilingual pre-trained Large Language Models (LLMs) are incredibly effective at Question Answering (QA), a core task in Natural Language Understanding, achieving high accuracies on several multilingual benchmarks. However, little is known about how well their confidences are calibrated. In this paper, we comprehensively benchmark the calibration of several multilingual LLMs (MLLMs) on a variety of QA tasks. We perform extensive experiments, spanning encoder-only, encoder-decoder, and decoder-only QA models (size varying from 110M to 7B parameters) and diverse languages, including both high- and low-resource ones. We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings, and investigate strategies to improve it, including post-hoc methods and regularized fine-tuning. For decoder-only LLMs such as LlaMa2, we additionally find that in-context learning improves confidence calibration on multilingual data. We also conduct several ablation experiments to study the effect of language distances, language corpus size, and model size on calibration, and how multilingual models compare with their monolingual counterparts for diverse tasks and languages. Our experiments suggest that the multilingual QA models are poorly calibrated for languages other than English and incorporating a small set of cheaply translated multilingual samples during fine-tuning/calibration effectively enhances the calibration performance.

On the Calibration of Multilingual Question Answering LLMs

TL;DR

This work investigates confidence calibration in multilingual QA large language models across extractive, generative, and decoder-only architectures, focusing on cross-lingual transfer, in-distribution and OOD settings. It benchmarks five pre-trained MLLMs on XQuAD, MLQA, and TyDiQA, and evaluates calibration strategies including temperature scaling, label smoothing, and translation-based data augmentation, plus in-context learning for LLaMa2. The study finds that English calibration does not generalize well to other languages, with syntactic distance and pre-training data size strongly predicting calibration performance; temperature scaling on mixed-language validation data and cheap translations during fine-tuning significantly improve cross-lingual calibration, while ICL also offers gains. These results provide practical guidance for deploying multilingual QA systems and point to future work in broader language coverage and domain settings.

Abstract

Multilingual pre-trained Large Language Models (LLMs) are incredibly effective at Question Answering (QA), a core task in Natural Language Understanding, achieving high accuracies on several multilingual benchmarks. However, little is known about how well their confidences are calibrated. In this paper, we comprehensively benchmark the calibration of several multilingual LLMs (MLLMs) on a variety of QA tasks. We perform extensive experiments, spanning encoder-only, encoder-decoder, and decoder-only QA models (size varying from 110M to 7B parameters) and diverse languages, including both high- and low-resource ones. We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings, and investigate strategies to improve it, including post-hoc methods and regularized fine-tuning. For decoder-only LLMs such as LlaMa2, we additionally find that in-context learning improves confidence calibration on multilingual data. We also conduct several ablation experiments to study the effect of language distances, language corpus size, and model size on calibration, and how multilingual models compare with their monolingual counterparts for diverse tasks and languages. Our experiments suggest that the multilingual QA models are poorly calibrated for languages other than English and incorporating a small set of cheaply translated multilingual samples during fine-tuning/calibration effectively enhances the calibration performance.
Paper Structure (37 sections, 3 equations, 11 figures, 23 tables)

This paper contains 37 sections, 3 equations, 11 figures, 23 tables.

Figures (11)

  • Figure 1: In this plot we show that pre-trained multilingual models, fine-tuned on English QA, are not well calibrated in languages other than English, specifically low-resource ones like Bengali. (a) shows the reliability diagram for XLM xlm-r fine-tuned on the English TyDiQA training set, evaluated on the Bengali TyDiQA test set. The large deviation from the diagonal Y=X line indicates it is not well calibrated. (b), (c) and (d) show that Temperature Scaling (TS), fine-tuning with more English data, and fine-tuning on Bengali TyDiQA training data respectively, all improve calibration compared to (a), as indicated by the better alignment with the diagonal and the lower ECE score. This indicates that despite high zero-shot cross-lingual accuracy, zero-shot cross-lingual calibration is not good for LLMs, unless dedicated calibration strategies, like TS, are used to improve them.
  • Figure 2: Differences in the output format between extractive and generative QA models.
  • Figure 3: Calibration performance of five different models on the XQuAD dataset. Note that ECE is lower the better. mBART gets higher variance on TH and EL because it has not seen the two languages at the pre-training stage.
  • Figure 4: Different colors denote different examples and the different shapes denote different languages, eg: squares are English and circles are German. Thus each example (color) has a corresponding translation in the other languages (shapes). En denotes a subset of the English data and En-Large denotes the full English data with available translations. En-tr denotes the En subset along with its translations in other languages. Mixed denotes each subset from a different language. Note: Each colored shape has the same number of examples and thus En-Large, En-tr and Mixed have the same size.
  • Figure 5: When appending the Korean and Swahili examples in the prompts, the model is more confident about the correct prediction and less confidence about the wrong prediction.
  • ...and 6 more figures