Table of Contents
Fetching ...

Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering

Lorena Calvo-Bartolomé, Valérie Aldana, Karla Cantarero, Alonso Madroñal de Mesa, Jerónimo Arenas-García, Jordan Boyd-Graber

TL;DR

This work tackles the challenge of maintaining factual accuracy and cultural relevance in multilingual question answering. It introduces MIND, a data-level, four-stage pipeline that first aligns multilingual content through polylingual topic modeling, then generates anchor-based questions, retrieves semantically related evidence across languages by topic, and finally uses LLMs to generate grounded answers and detect discrepancies. The authors validate MIND on bilingual maternal health content (ROSIE) and generalize to controlled (FEVER-DPLACE-Q) and cross-domain (WIKI-EN-DE) datasets, showing it can surface both factual inconsistencies and cultural discrepancies. A key contribution is the ROSIE-MIND dataset, plus a comprehensive ablation study that demonstrates the value of topic-based retrieval and cross-language alignment while highlighting the ongoing need for human-in-the-loop verification to handle nuanced dissonances. Overall, the approach offers a practical path toward more culturally aware and consistent multilingual QA systems by surfacing, classifying, and contextualizing cross-language divergences prior to user interaction.

Abstract

Multilingual question answering (QA) systems must ensure factual consistency across languages, especially for objective queries such as What is jaundice?, while also accounting for cultural variation in subjective responses. We propose MIND, a user-in-the-loop fact-checking pipeline to detect factual and cultural discrepancies in multilingual QA knowledge bases. MIND highlights divergent answers to culturally sensitive questions (e.g., Who assists in childbirth?) that vary by region and context. We evaluate MIND on a bilingual QA system in the maternal and infant health domain and release a dataset of bilingual questions annotated for factual and cultural inconsistencies. We further test MIND on datasets from other domains to assess generalization. In all cases, MIND reliably identifies inconsistencies, supporting the development of more culturally aware and factually consistent QA systems.

Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering

TL;DR

This work tackles the challenge of maintaining factual accuracy and cultural relevance in multilingual question answering. It introduces MIND, a data-level, four-stage pipeline that first aligns multilingual content through polylingual topic modeling, then generates anchor-based questions, retrieves semantically related evidence across languages by topic, and finally uses LLMs to generate grounded answers and detect discrepancies. The authors validate MIND on bilingual maternal health content (ROSIE) and generalize to controlled (FEVER-DPLACE-Q) and cross-domain (WIKI-EN-DE) datasets, showing it can surface both factual inconsistencies and cultural discrepancies. A key contribution is the ROSIE-MIND dataset, plus a comprehensive ablation study that demonstrates the value of topic-based retrieval and cross-language alignment while highlighting the ongoing need for human-in-the-loop verification to handle nuanced dissonances. Overall, the approach offers a practical path toward more culturally aware and consistent multilingual QA systems by surfacing, classifying, and contextualizing cross-language divergences prior to user interaction.

Abstract

Multilingual question answering (QA) systems must ensure factual consistency across languages, especially for objective queries such as What is jaundice?, while also accounting for cultural variation in subjective responses. We propose MIND, a user-in-the-loop fact-checking pipeline to detect factual and cultural discrepancies in multilingual QA knowledge bases. MIND highlights divergent answers to culturally sensitive questions (e.g., Who assists in childbirth?) that vary by region and context. We evaluate MIND on a bilingual QA system in the maternal and infant health domain and release a dataset of bilingual questions annotated for factual and cultural inconsistencies. We further test MIND on datasets from other domains to assess generalization. In all cases, MIND reliably identifies inconsistencies, supporting the development of more culturally aware and factually consistent QA systems.

Paper Structure

This paper contains 47 sections, 9 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: mind overview. (\ref{['subsec:pltm']}) Passages are aligned using topic modeling. (\ref{['subsec:indexing']}) A topic is selected. (\ref{['subsec:question_generation']})–(\ref{['subsec:discrepancy_detection']}) An llm aids question decomposition (\ref{['subsec:question_generation']}), answering (\ref{['subsec:retrieval']}), and discrepancy detection (\ref{['subsec:discrepancy_detection']}).
  • Figure 2: Mean fulfillment rates for question (a) and answer (b) quality by model, based on majority vote from three annotators. Answer results are split by anchor vs. comparison corpora. Models generate questions reliably, but answer quality varies between $C^{(a)}$ and $C^{(c)}$.
  • Figure 3: F1 scores per annotator and number of instances per category per model (for the controlled dataset, labeled synthetic, counts are actual instances). Dashed lines mark the mean across annotators. llama3.3:70b and qwen:32b cover all categories but yield more false positives, while gpt-4o predicts none.
  • Figure 4: Confusion matrices for discrepancy classification across models on the FEVER-DPLACE dataset. All models perform strongly, though qwen:32b and gpt-4o show more confusion between c and nd, while llama3.3:70b tends to mistake cd for c or nd.
  • Figure 11: Instructions for Answer quality evaluation.
  • ...and 5 more figures