MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

Nandan Thakur; Suleman Kazi; Ge Luo; Jimmy Lin; Amin Ahmad

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

Nandan Thakur, Suleman Kazi, Ge Luo, Jimmy Lin, Amin Ahmad

TL;DR

Mirage-Bench presents a scalable multilingual RAG benchmark by fusing cheap heuristic features with a learned surrogate judge to approximate an arena-based ranking guided by GPT-4o judgments. It evaluates 18 languages and 19 frontier LLMs using a two-stage workflow: (i) deterministic and LLM-measured heuristic evaluation, and (ii) a learned surrogate (random forest) that imitates a Bradley-Terry leaderboard, enabling inexpensive, repeatable rankings. The study demonstrates a strong alignment with GPT-4o ($\tau = 0.909$) and shows that large proprietary and open-source models currently dominate, while instruction-tuned data can boost smaller models; the work also provides data and code to spur further development in multilingual RAG. Overall, Mirage-Bench offers a practical path to multilingual RAG evaluation that scales beyond English and mitigates the cost of heavy LLM judges while preserving ranking fidelity.

Abstract

Traditional retrieval-augmented generation (RAG) benchmarks evaluate systems using heuristic-based metrics, but these require human preferences as the ground truth for reference. In contrast, arena-based benchmarks, where systems compete against each other, require an expensive large language model (LLM) as a judge for a reliable evaluation. We present a simple efficient technique to combine the best of both worlds. The idea is to train a surrogate judge using heuristic metrics as input, to output the LLM as a judge prediction. In our work, we develop MIRAGE-Bench, a synthetic arena-based RAG benchmark for 18 diverse languages on Wikipedia focused on multilingual answer generation evaluation. It extensively couples both heuristic features and LLM as a judge for evaluation. We benchmark 19 multilingual LLMs, and observe a high correlation (Kendall Tau ($τ$) = 0.909) using our surrogate judge and between GPT-4o as a teacher using the Bradley-Terry framework. Our results show proprietary and large open-source LLMs currently dominate on MIRAGE-Bench. Our code and datasets are made publicly available here: https://github.com/vectara/mirage-bench.

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

TL;DR

Abstract

MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)