Table of Contents
Fetching ...

M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation

Zhaopeng Feng, Jiayuan Su, Jiamei Zheng, Jiahan Ren, Yan Zhang, Jian Wu, Hongwei Wang, Zuozhu Liu

TL;DR

This paper introduces M-MAD, a novel MT evaluation framework that replaces monolithic LLM judgments with a Multidimensional Quality Metrics–driven, multi-agent debating system. By decoupling MQM into four evaluation dimensions and conducting dedicated debates per dimension before synthesizing a final judgment, M-MAD achieves superior segment-level and competitive system-level performance against both LLM-basedand automatic metrics, even with a suboptimal model. Extensive ablations show that dimension partition, accurate error-span predictions, and structured debating strategies underpin the improvements. The work demonstrates the potential of LLM-driven, collaborative reasoning for robust translation evaluation and sets a foundation for future heterogeneous-agent exploration.

Abstract

Recent advancements in large language models (LLMs) have given rise to the LLM-as-a-judge paradigm, showcasing their potential to deliver human-like judgments. However, in the field of machine translation (MT) evaluation, current LLM-as-a-judge methods fall short of learned automatic metrics. In this paper, we propose Multidimensional Multi-Agent Debate (M-MAD), a systematic LLM-based multi-agent framework for advanced LLM-as-a-judge MT evaluation. Our findings demonstrate that M-MAD achieves significant advancements by (1) decoupling heuristic MQM criteria into distinct evaluation dimensions for fine-grained assessments; (2) employing multi-agent debates to harness the collaborative reasoning capabilities of LLMs; (3) synthesizing dimension-specific results into a final evaluation judgment to ensure robust and reliable outcomes. Comprehensive experiments show that M-MAD not only outperforms all existing LLM-as-a-judge methods but also competes with state-of-the-art reference-based automatic metrics, even when powered by a suboptimal model like GPT-4o mini. Detailed ablations and analysis highlight the superiority of our framework design, offering a fresh perspective for LLM-as-a-judge paradigm. Our code and data are publicly available at https://github.com/SU-JIAYUAN/M-MAD.

M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation

TL;DR

This paper introduces M-MAD, a novel MT evaluation framework that replaces monolithic LLM judgments with a Multidimensional Quality Metrics–driven, multi-agent debating system. By decoupling MQM into four evaluation dimensions and conducting dedicated debates per dimension before synthesizing a final judgment, M-MAD achieves superior segment-level and competitive system-level performance against both LLM-basedand automatic metrics, even with a suboptimal model. Extensive ablations show that dimension partition, accurate error-span predictions, and structured debating strategies underpin the improvements. The work demonstrates the potential of LLM-driven, collaborative reasoning for robust translation evaluation and sets a foundation for future heterogeneous-agent exploration.

Abstract

Recent advancements in large language models (LLMs) have given rise to the LLM-as-a-judge paradigm, showcasing their potential to deliver human-like judgments. However, in the field of machine translation (MT) evaluation, current LLM-as-a-judge methods fall short of learned automatic metrics. In this paper, we propose Multidimensional Multi-Agent Debate (M-MAD), a systematic LLM-based multi-agent framework for advanced LLM-as-a-judge MT evaluation. Our findings demonstrate that M-MAD achieves significant advancements by (1) decoupling heuristic MQM criteria into distinct evaluation dimensions for fine-grained assessments; (2) employing multi-agent debates to harness the collaborative reasoning capabilities of LLMs; (3) synthesizing dimension-specific results into a final evaluation judgment to ensure robust and reliable outcomes. Comprehensive experiments show that M-MAD not only outperforms all existing LLM-as-a-judge methods but also competes with state-of-the-art reference-based automatic metrics, even when powered by a suboptimal model like GPT-4o mini. Detailed ablations and analysis highlight the superiority of our framework design, offering a fresh perspective for LLM-as-a-judge paradigm. Our code and data are publicly available at https://github.com/SU-JIAYUAN/M-MAD.
Paper Structure (20 sections, 1 equation, 17 figures, 13 tables)

This paper contains 20 sections, 1 equation, 17 figures, 13 tables.

Figures (17)

  • Figure 1: Comparison between Single-Agent (SA) MT Evaluation and Multidimensional Multi-Agent Debate (M-MAD). SA: A single agent evaluates translations using a coupled MQM template (e.g., GEMBA kocmi2023gemba). M-MAD: a) decouple MQM paradigm into four evaluation dimensions; b) multi-agent groups debate within each dimension; c) a judge synthesizes viewpoints from debate groups into the final evaluation. We view M-MAD as analogous to the structure of "a neural network in natural language form". Each stage functions like a layer, each agent serves as a neuron, and their interactions act as hidden states.
  • Figure 2: MQM score distribution for WMT 23 Metrics Shared Task ZH-EN set. HQ: High-quality translations with MQM score $=$ 0. MQ: Medium-quality translations with -5 $<$ MQM score $<$ 0. LQ: Low-quality translations with MQM score $\leq$ -5.
  • Figure 3: Performance with increased multi-agent debating rounds on WMT 23 ZH-EN.
  • Figure 4: The general prompt for GEMBA-MQM, with the "(user)" and "(assistant)" section repeated for each few-shot example.
  • Figure 5: Three examples used for GEMBA-MQM.
  • ...and 12 more figures