Table of Contents
Fetching ...

An Analysis of Multilingual FActScore

Kim Trong Vu, Michael Krumdick, Varshini Reddy, Franck Dernoncourt, Viet Dac Lai

TL;DR

This work extends the FActScore factuality pipeline to multilingual settings by dissecting its four components, introducing native annotations in three languages, and analyzing their interactions. It reveals that atomic fact extraction and factuality scoring both deteriorate with lower resource languages and that knowledge-source coverage critically limits accuracy, with Wikipedia alone being insufficient for medium/low-resource languages. The study demonstrates that expanding the knowledge source (Internet access) and leveraging LLMs as knowledge generators or better retrievers substantially improve multilingual factuality estimation, and that finetuning open models can close the gap with larger closed models for extraction. The findings offer practical guidance for building robust multilingual factuality evaluators and underscore the importance of diverse knowledge sources and scalable data creation for open-ended text evaluation.

Abstract

FActScore has gained popularity as a metric to estimate the factuality of long-form texts generated by Large Language Models (LLMs) in English. However, there has not been any work in studying the behavior of FActScore in other languages. This paper studies the limitations of each component in the four-component pipeline of FActScore in the multilingual setting. We introduce a new dataset for FActScore on texts generated by strong multilingual LLMs. Our evaluation shows that LLMs exhibit distinct behaviors in both fact extraction and fact scoring tasks. No LLM produces consistent and reliable FActScore across languages with varying levels of resources. We also find that the knowledge source plays an important role in the quality of the estimated FActScore. Using Wikipedia as the knowledge source may hinder the true FActScore of long-form text due to its limited coverage in medium- and low-resource languages. We also incorporate three mitigations to our knowledge source that ultimately improve FActScore estimation across all languages.

An Analysis of Multilingual FActScore

TL;DR

This work extends the FActScore factuality pipeline to multilingual settings by dissecting its four components, introducing native annotations in three languages, and analyzing their interactions. It reveals that atomic fact extraction and factuality scoring both deteriorate with lower resource languages and that knowledge-source coverage critically limits accuracy, with Wikipedia alone being insufficient for medium/low-resource languages. The study demonstrates that expanding the knowledge source (Internet access) and leveraging LLMs as knowledge generators or better retrievers substantially improve multilingual factuality estimation, and that finetuning open models can close the gap with larger closed models for extraction. The findings offer practical guidance for building robust multilingual factuality evaluators and underscore the importance of diverse knowledge sources and scalable data creation for open-ended text evaluation.

Abstract

FActScore has gained popularity as a metric to estimate the factuality of long-form texts generated by Large Language Models (LLMs) in English. However, there has not been any work in studying the behavior of FActScore in other languages. This paper studies the limitations of each component in the four-component pipeline of FActScore in the multilingual setting. We introduce a new dataset for FActScore on texts generated by strong multilingual LLMs. Our evaluation shows that LLMs exhibit distinct behaviors in both fact extraction and fact scoring tasks. No LLM produces consistent and reliable FActScore across languages with varying levels of resources. We also find that the knowledge source plays an important role in the quality of the estimated FActScore. Using Wikipedia as the knowledge source may hinder the true FActScore of long-form text due to its limited coverage in medium- and low-resource languages. We also incorporate three mitigations to our knowledge source that ultimately improve FActScore estimation across all languages.
Paper Structure (31 sections, 2 equations, 9 figures, 23 tables)

This paper contains 31 sections, 2 equations, 9 figures, 23 tables.

Figures (9)

  • Figure 1: FActScore (upper) and Scoring Accuracy (lower) predicted by 4 scorers (GPT4, GemP, GPT3.5, Mistral) in comparison with FActScore by human (R2) on texts generated by GPT4 and GemP in native languages.
  • Figure 2: Accuracy of Factuality Scoring task with different knowledge sources. L stands for Local/Domestic, while I stands for International. P stands for Popular and UP stands for UnPopular.
  • Figure 3: Prediction agreement between two variants of facts (in target language and in translated English).
  • Figure 4: FActScore (upper) and Scoring accuracy (lower) by fact scorers with and without translation in comparison with FActScore by human (R2) on texts generated by GPT4 and GemP. Dash lines denote the translation being used, along with corresponding scorers.
  • Figure 5: FActScore by Mistral and BLOOMZ on translated facts generated by studied subject models from min-etal-2023-FActScore (R1), compared to golden scoring by GPT3.5, as suggested by min-etal-2023-FActScore.
  • ...and 4 more figures