Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering
Lorena Calvo-Bartolomé, Valérie Aldana, Karla Cantarero, Alonso Madroñal de Mesa, Jerónimo Arenas-García, Jordan Boyd-Graber
TL;DR
This work tackles the challenge of maintaining factual accuracy and cultural relevance in multilingual question answering. It introduces MIND, a data-level, four-stage pipeline that first aligns multilingual content through polylingual topic modeling, then generates anchor-based questions, retrieves semantically related evidence across languages by topic, and finally uses LLMs to generate grounded answers and detect discrepancies. The authors validate MIND on bilingual maternal health content (ROSIE) and generalize to controlled (FEVER-DPLACE-Q) and cross-domain (WIKI-EN-DE) datasets, showing it can surface both factual inconsistencies and cultural discrepancies. A key contribution is the ROSIE-MIND dataset, plus a comprehensive ablation study that demonstrates the value of topic-based retrieval and cross-language alignment while highlighting the ongoing need for human-in-the-loop verification to handle nuanced dissonances. Overall, the approach offers a practical path toward more culturally aware and consistent multilingual QA systems by surfacing, classifying, and contextualizing cross-language divergences prior to user interaction.
Abstract
Multilingual question answering (QA) systems must ensure factual consistency across languages, especially for objective queries such as What is jaundice?, while also accounting for cultural variation in subjective responses. We propose MIND, a user-in-the-loop fact-checking pipeline to detect factual and cultural discrepancies in multilingual QA knowledge bases. MIND highlights divergent answers to culturally sensitive questions (e.g., Who assists in childbirth?) that vary by region and context. We evaluate MIND on a bilingual QA system in the maternal and infant health domain and release a dataset of bilingual questions annotated for factual and cultural inconsistencies. We further test MIND on datasets from other domains to assess generalization. In all cases, MIND reliably identifies inconsistencies, supporting the development of more culturally aware and factually consistent QA systems.
