Evaluation Methodology for Large Language Models for Multilingual Document Question and Answer
Adar Kahana, Jaya Susan Mathew, Said Bleik, Jeremy Reynolds, Oren Elisha
TL;DR
This work proposes a methodology to evaluate multilingual QA performance of large language models by integrating translation strategies into the QA pipeline and testing across diverse datasets (XQuAD, SQuAD, ESG, HeQ). Using GPT-4-32K and GPT-3.5-Turbo via Azure OpenAI, the study compares full versus partial translations and English-centric workflows, revealing that English operation often yields the strongest results despite added translation costs. Key findings include substantial accuracy gains with GPT-4, the advantage of partial translation over full translation in localization contexts, and dataset-specific effects due to input formats like PDFs and non-English content such as Hebrew. The work provides practical guidance for multilingual QA deployments, highlighting model/version selection and translation considerations to optimize real-world performance.
Abstract
With the widespread adoption of Large Language Models (LLMs), in this paper we investigate the multilingual capability of these models. Our preliminary results show that, translating the native language context, question and answer into a high resource language produced the best results.
