On the Calibration of Multilingual Question Answering LLMs
Yahan Yang, Soham Dan, Dan Roth, Insup Lee
TL;DR
This work investigates confidence calibration in multilingual QA large language models across extractive, generative, and decoder-only architectures, focusing on cross-lingual transfer, in-distribution and OOD settings. It benchmarks five pre-trained MLLMs on XQuAD, MLQA, and TyDiQA, and evaluates calibration strategies including temperature scaling, label smoothing, and translation-based data augmentation, plus in-context learning for LLaMa2. The study finds that English calibration does not generalize well to other languages, with syntactic distance and pre-training data size strongly predicting calibration performance; temperature scaling on mixed-language validation data and cheap translations during fine-tuning significantly improve cross-lingual calibration, while ICL also offers gains. These results provide practical guidance for deploying multilingual QA systems and point to future work in broader language coverage and domain settings.
Abstract
Multilingual pre-trained Large Language Models (LLMs) are incredibly effective at Question Answering (QA), a core task in Natural Language Understanding, achieving high accuracies on several multilingual benchmarks. However, little is known about how well their confidences are calibrated. In this paper, we comprehensively benchmark the calibration of several multilingual LLMs (MLLMs) on a variety of QA tasks. We perform extensive experiments, spanning encoder-only, encoder-decoder, and decoder-only QA models (size varying from 110M to 7B parameters) and diverse languages, including both high- and low-resource ones. We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings, and investigate strategies to improve it, including post-hoc methods and regularized fine-tuning. For decoder-only LLMs such as LlaMa2, we additionally find that in-context learning improves confidence calibration on multilingual data. We also conduct several ablation experiments to study the effect of language distances, language corpus size, and model size on calibration, and how multilingual models compare with their monolingual counterparts for diverse tasks and languages. Our experiments suggest that the multilingual QA models are poorly calibrated for languages other than English and incorporating a small set of cheaply translated multilingual samples during fine-tuning/calibration effectively enhances the calibration performance.
