Table of Contents
Fetching ...

"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation

Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing Chen, Mehdi Rezagholizadeh, Jimmy Lin

TL;DR

NoMIRACL introduces a large multilingual benchmark for evaluating robustness of retrieval-augmented generation across 18 languages, using two subsets (non-relevant and relevant) to separately measure hallucination and grounding errors. The dataset relies on human annotations and a zero-shot evaluation protocol with top-k retrieved passages, providing metrics for hallucination rate and error rate to quantify LLM behavior under retrieval noise. Across 11 representative models, GPT-4 generally offers the best balance between avoiding hallucinations and correctly grounding in relevant passages, while many models struggle on one dimension or the other, especially in low-resource languages. The work also analyzes prompting and fine-tuning strategies, highlighting trade-offs and the need for further research to build robust, multilingual RAG systems; the dataset and code are released for community use.

Abstract

Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. However, prior work lacks a comprehensive evaluation of different language families, making it challenging to evaluate LLM robustness against errors in external retrieved knowledge. To overcome this, we establish NoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. NoMIRACL includes both a non-relevant and a relevant subset. Queries in the non-relevant subset contain passages judged as non-relevant, whereas queries in the relevant subset include at least a single judged relevant passage. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.In our work, we observe that most models struggle to balance the two capacities. Models such as LLAMA-2 and Orca-2 achieve over 88% hallucination rate on the non-relevant subset. Mistral and LLAMA-3 hallucinate less but can achieve up to a 74.9% error rate on the relevant subset. Overall, GPT-4 is observed to provide the best tradeoff on both subsets, highlighting future work necessary to improve LLM robustness. NoMIRACL dataset and evaluation code are available at: https://github.com/project-miracl/nomiracl.

"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation

TL;DR

NoMIRACL introduces a large multilingual benchmark for evaluating robustness of retrieval-augmented generation across 18 languages, using two subsets (non-relevant and relevant) to separately measure hallucination and grounding errors. The dataset relies on human annotations and a zero-shot evaluation protocol with top-k retrieved passages, providing metrics for hallucination rate and error rate to quantify LLM behavior under retrieval noise. Across 11 representative models, GPT-4 generally offers the best balance between avoiding hallucinations and correctly grounding in relevant passages, while many models struggle on one dimension or the other, especially in low-resource languages. The work also analyzes prompting and fine-tuning strategies, highlighting trade-offs and the need for further research to build robust, multilingual RAG systems; the dataset and code are released for community use.

Abstract

Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. However, prior work lacks a comprehensive evaluation of different language families, making it challenging to evaluate LLM robustness against errors in external retrieved knowledge. To overcome this, we establish NoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. NoMIRACL includes both a non-relevant and a relevant subset. Queries in the non-relevant subset contain passages judged as non-relevant, whereas queries in the relevant subset include at least a single judged relevant passage. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.In our work, we observe that most models struggle to balance the two capacities. Models such as LLAMA-2 and Orca-2 achieve over 88% hallucination rate on the non-relevant subset. Mistral and LLAMA-3 hallucinate less but can achieve up to a 74.9% error rate on the relevant subset. Overall, GPT-4 is observed to provide the best tradeoff on both subsets, highlighting future work necessary to improve LLM robustness. NoMIRACL dataset and evaluation code are available at: https://github.com/project-miracl/nomiracl.
Paper Structure (25 sections, 10 figures, 9 tables)

This paper contains 25 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: LLM robustness evaluation as a binary tree in NoMIRACL. When dealing with queries in the non-relevant subset, the LLM is expected to disregard all noisy passages and refrain from answering ($\mathrm{TN}$). Conversely, for queries in the relevant subset, the LLM should recognize the relevant passage and provide a valid answer ($\mathrm{TP}$).
  • Figure 2: Confusion matrix for robustness evaluation with NoMIRACL. More details are provided in (§\ref{['sec:robustness']}); (Subset) denotes the ground-truth in NoMIRACL; (Pred.) denotes the LLM output prediction.
  • Figure 3: An overview of the data construction procedure (for English) involved in NoMIRACL.
  • Figure 4: Vanilla zero-shot prompt template used in our experiments for LLM hallucination evaluation for all 18 languages in NoMIRACL. The instruction is provided in English, similar to ahuja:2023.
  • Figure 5: Hallucination rate (in %) = $\mathrm{FP}/(\mathrm{FP} + \mathrm{TN}$) on the non-relevant subset ($\mathrm{F}$) in NoMIRACL test split. The non-relevant subset contains queries with no known answers, i.e., all top-$k$ (where $k=10$) passages are judged by a human annotator as non-relevant. A majority of LLMs (except Mistral) hallucinate on the non-relevant subset. Lower the hallucination rate is better. The best model in each category is plotted (see \ref{['fig:non-relevant-baseline-results-all']} for all models).
  • ...and 5 more figures