Table of Contents
Fetching ...

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

Nandan Thakur, Suleman Kazi, Ge Luo, Jimmy Lin, Amin Ahmad

TL;DR

Mirage-Bench presents a scalable multilingual RAG benchmark by fusing cheap heuristic features with a learned surrogate judge to approximate an arena-based ranking guided by GPT-4o judgments. It evaluates 18 languages and 19 frontier LLMs using a two-stage workflow: (i) deterministic and LLM-measured heuristic evaluation, and (ii) a learned surrogate (random forest) that imitates a Bradley-Terry leaderboard, enabling inexpensive, repeatable rankings. The study demonstrates a strong alignment with GPT-4o ($\tau = 0.909$) and shows that large proprietary and open-source models currently dominate, while instruction-tuned data can boost smaller models; the work also provides data and code to spur further development in multilingual RAG. Overall, Mirage-Bench offers a practical path to multilingual RAG evaluation that scales beyond English and mitigates the cost of heavy LLM judges while preserving ranking fidelity.

Abstract

Traditional retrieval-augmented generation (RAG) benchmarks evaluate systems using heuristic-based metrics, but these require human preferences as the ground truth for reference. In contrast, arena-based benchmarks, where systems compete against each other, require an expensive large language model (LLM) as a judge for a reliable evaluation. We present a simple efficient technique to combine the best of both worlds. The idea is to train a surrogate judge using heuristic metrics as input, to output the LLM as a judge prediction. In our work, we develop MIRAGE-Bench, a synthetic arena-based RAG benchmark for 18 diverse languages on Wikipedia focused on multilingual answer generation evaluation. It extensively couples both heuristic features and LLM as a judge for evaluation. We benchmark 19 multilingual LLMs, and observe a high correlation (Kendall Tau ($τ$) = 0.909) using our surrogate judge and between GPT-4o as a teacher using the Bradley-Terry framework. Our results show proprietary and large open-source LLMs currently dominate on MIRAGE-Bench. Our code and datasets are made publicly available here: https://github.com/vectara/mirage-bench.

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

TL;DR

Mirage-Bench presents a scalable multilingual RAG benchmark by fusing cheap heuristic features with a learned surrogate judge to approximate an arena-based ranking guided by GPT-4o judgments. It evaluates 18 languages and 19 frontier LLMs using a two-stage workflow: (i) deterministic and LLM-measured heuristic evaluation, and (ii) a learned surrogate (random forest) that imitates a Bradley-Terry leaderboard, enabling inexpensive, repeatable rankings. The study demonstrates a strong alignment with GPT-4o () and shows that large proprietary and open-source models currently dominate, while instruction-tuned data can boost smaller models; the work also provides data and code to spur further development in multilingual RAG. Overall, Mirage-Bench offers a practical path to multilingual RAG evaluation that scales beyond English and mitigates the cost of heavy LLM judges while preserving ranking fidelity.

Abstract

Traditional retrieval-augmented generation (RAG) benchmarks evaluate systems using heuristic-based metrics, but these require human preferences as the ground truth for reference. In contrast, arena-based benchmarks, where systems compete against each other, require an expensive large language model (LLM) as a judge for a reliable evaluation. We present a simple efficient technique to combine the best of both worlds. The idea is to train a surrogate judge using heuristic metrics as input, to output the LLM as a judge prediction. In our work, we develop MIRAGE-Bench, a synthetic arena-based RAG benchmark for 18 diverse languages on Wikipedia focused on multilingual answer generation evaluation. It extensively couples both heuristic features and LLM as a judge for evaluation. We benchmark 19 multilingual LLMs, and observe a high correlation (Kendall Tau () = 0.909) using our surrogate judge and between GPT-4o as a teacher using the Bradley-Terry framework. Our results show proprietary and large open-source LLMs currently dominate on MIRAGE-Bench. Our code and datasets are made publicly available here: https://github.com/vectara/mirage-bench.

Paper Structure

This paper contains 22 sections, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Multilingual naive RAG pipeline in Hindi (hn). In Mirage-Bench, we reuse the oracle retrieval set (query and oracle judged passages) from miraclzhang:2023 and focus on evaluating the answer generation stage with multilingual LLMs.
  • Figure 2: The Mirage-Bench evaluation flowchart consists of three steps: (i) heuristic-based features evaluating the baseline model response across several dimensions; (ii) exhaustive pairwise comparisons with GPT-4o as a judge on a small subset of queries to train our surrogate judge. (iii) After training, we utilize our surrogate judge to output the model ranking on the whole subset of queries, to construct the synthetic RAG arena-based leaderboard.
  • Figure 3: Lollipop plots denoting the average heuristic-based feature scores achieved by LLM baselines for each language in Mirage-Bench. $x$-axis denotes the 18 languages; whereas $y$-axis plots every heuristic feature score. Models in the same LLM family are represented in the same color in a lollipop (as multiple circles). \ref{['fig:heuristics-appendix']} in the Appendix provides lollipop plots for all eleven heuristic-based features used in our work.
  • Figure 4: Mirage-Bench arena-based leaderboards: (left heatmap) Bradley-Terry model coefficients with GPT-4o as a pairwise judge for a subset of 100 sampled queries; (right heatmap) Synthetic rankings using heuristic-based features and a random forest model as a surrogate judge on all queries. Each highlighted cell denotes the rank of the LLM (lower the better). LLMs are sorted by lowest to highest average rank across all 18 languages.
  • Figure 5: Boxplot with the feature importance value (averaged across 18 languages in Mirage-Bench) observed by the learning to rank (random forest) model.
  • ...and 8 more figures