Table of Contents
Fetching ...

BatchGEMBA: Token-Efficient Machine Translation Evaluation with Batched Prompting and Prompt Compression

Daniil Larionov, Steffen Eger

TL;DR

BatchGEMBA-MQM extends GEMBA-MQM by enabling batched prompting to reduce token overhead in MT evaluation and introduces a batching-aware prompt compressor to preserve evaluation fidelity. Across multiple LLMs and batch sizes, batching achieves 2–4x token savings, while compression adds an additional 13–15% reduction and mitigates quality degradation, with GPT-4o retaining over 90% of baseline quality at batch size 4 when compressed. The approach demonstrates model-dependent robustness to batching, with some models experiencing notable declines and others staying near their single-example performance. The work offers a practical, scalable path toward efficient, prompt-based MT evaluation and provides code and trained models to support future research.

Abstract

Recent advancements in Large Language Model (LLM)-based Natural Language Generation evaluation have largely focused on single-example prompting, resulting in significant token overhead and computational inefficiencies. In this work, we introduce BatchGEMBA-MQM, a framework that integrates batched prompting with the GEMBA-MQM metric for machine translation evaluation. Our approach aggregates multiple translation examples into a single prompt, reducing token usage by 2-4 times (depending on the batch size) relative to single-example prompting. Furthermore, we propose a batching-aware prompt compression model that achieves an additional token reduction of 13-15% on average while also showing ability to help mitigate batching-induced quality degradation. Evaluations across several LLMs (GPT-4o, GPT-4o-mini, Mistral Small, Phi4, and CommandR7B) and varying batch sizes reveal that while batching generally negatively affects quality (but sometimes not substantially), prompt compression does not degrade further, and in some cases, recovers quality loss. For instance, GPT-4o retains over 90% of its baseline performance at a batch size of 4 when compression is applied, compared to a 44.6% drop without compression. We plan to release our code and trained models at https://github.com/NL2G/batchgemba to support future research in this domain.

BatchGEMBA: Token-Efficient Machine Translation Evaluation with Batched Prompting and Prompt Compression

TL;DR

BatchGEMBA-MQM extends GEMBA-MQM by enabling batched prompting to reduce token overhead in MT evaluation and introduces a batching-aware prompt compressor to preserve evaluation fidelity. Across multiple LLMs and batch sizes, batching achieves 2–4x token savings, while compression adds an additional 13–15% reduction and mitigates quality degradation, with GPT-4o retaining over 90% of baseline quality at batch size 4 when compressed. The approach demonstrates model-dependent robustness to batching, with some models experiencing notable declines and others staying near their single-example performance. The work offers a practical, scalable path toward efficient, prompt-based MT evaluation and provides code and trained models to support future research.

Abstract

Recent advancements in Large Language Model (LLM)-based Natural Language Generation evaluation have largely focused on single-example prompting, resulting in significant token overhead and computational inefficiencies. In this work, we introduce BatchGEMBA-MQM, a framework that integrates batched prompting with the GEMBA-MQM metric for machine translation evaluation. Our approach aggregates multiple translation examples into a single prompt, reducing token usage by 2-4 times (depending on the batch size) relative to single-example prompting. Furthermore, we propose a batching-aware prompt compression model that achieves an additional token reduction of 13-15% on average while also showing ability to help mitigate batching-induced quality degradation. Evaluations across several LLMs (GPT-4o, GPT-4o-mini, Mistral Small, Phi4, and CommandR7B) and varying batch sizes reveal that while batching generally negatively affects quality (but sometimes not substantially), prompt compression does not degrade further, and in some cases, recovers quality loss. For instance, GPT-4o retains over 90% of its baseline performance at a batch size of 4 when compression is applied, compared to a 44.6% drop without compression. We plan to release our code and trained models at https://github.com/NL2G/batchgemba to support future research in this domain.

Paper Structure

This paper contains 19 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overall Structure of BatchGEMBA-MQM prompt.
  • Figure 2: Relative quality degradation, normalized to the each models performance at batch size $1$. '-' in model name indicate evaluation without compressed prompts, '+' indicates compressed prompts.