Table of Contents
Fetching ...

Fine-Grained and Multi-Dimensional Metrics for Document-Level Machine Translation

Yirong Sun, Dawei Zhu, Yanjun Chen, Erjia Xiao, Xinghao Chen, Xiaoyu Shen

TL;DR

The paper investigates the document-level translation capabilities of instruction-tuned LLMs by prompting models to translate entire documents in one pass (DOC) and comparing to a chunked sentence-by-sentence baseline (ST[$k$]). It reveals that DOC improves document coherence and fluency without document-level fine-tuning, yet standard BLEU-based metrics often misrepresent these gains, sometimes favoring sentence-level outputs. To address this, the authors propose a GPT-4–based LLM-as-a-judge evaluation across Fluency, Content Errors, Lexical Cohesion, and Grammatical Cohesion, finding strong alignment with human judgments (≈95% agreement) and highlighting the limitations of BLEU for docMT. The work argues for context-aware, interpretable evaluation in document-level MT and demonstrates that instruction-tuned LLMs can leverage document context effectively, while outlining limitations related to language directions, model scale, and context length.

Abstract

Large language models (LLMs) have excelled in various NLP tasks, including machine translation (MT), yet most studies focus on sentence-level translation. This work investigates the inherent capability of instruction-tuned LLMs for document-level translation (docMT). Unlike prior approaches that require specialized techniques, we evaluate LLMs by directly prompting them to translate entire documents in a single pass. Our results show that this method improves translation quality compared to translating sentences separately, even without document-level fine-tuning. However, this advantage is not reflected in BLEU scores, which often favor sentence-based translations. We propose using the LLM-as-a-judge paradigm for evaluation, where GPT-4 is used to assess document coherence, accuracy, and fluency in a more nuanced way than n-gram-based metrics. Overall, our work demonstrates that instruction-tuned LLMs can effectively leverage document context for translation. However, we caution against using BLEU scores for evaluating docMT, as they often provide misleading outcomes, failing to capture the quality of document-level translation. Code and the outputs from GPT4-as-a-judge are available at https://github.com/EIT-NLP/BLEUless_DocMT

Fine-Grained and Multi-Dimensional Metrics for Document-Level Machine Translation

TL;DR

The paper investigates the document-level translation capabilities of instruction-tuned LLMs by prompting models to translate entire documents in one pass (DOC) and comparing to a chunked sentence-by-sentence baseline (ST[]). It reveals that DOC improves document coherence and fluency without document-level fine-tuning, yet standard BLEU-based metrics often misrepresent these gains, sometimes favoring sentence-level outputs. To address this, the authors propose a GPT-4–based LLM-as-a-judge evaluation across Fluency, Content Errors, Lexical Cohesion, and Grammatical Cohesion, finding strong alignment with human judgments (≈95% agreement) and highlighting the limitations of BLEU for docMT. The work argues for context-aware, interpretable evaluation in document-level MT and demonstrates that instruction-tuned LLMs can leverage document context effectively, while outlining limitations related to language directions, model scale, and context length.

Abstract

Large language models (LLMs) have excelled in various NLP tasks, including machine translation (MT), yet most studies focus on sentence-level translation. This work investigates the inherent capability of instruction-tuned LLMs for document-level translation (docMT). Unlike prior approaches that require specialized techniques, we evaluate LLMs by directly prompting them to translate entire documents in a single pass. Our results show that this method improves translation quality compared to translating sentences separately, even without document-level fine-tuning. However, this advantage is not reflected in BLEU scores, which often favor sentence-based translations. We propose using the LLM-as-a-judge paradigm for evaluation, where GPT-4 is used to assess document coherence, accuracy, and fluency in a more nuanced way than n-gram-based metrics. Overall, our work demonstrates that instruction-tuned LLMs can effectively leverage document context for translation. However, we caution against using BLEU scores for evaluating docMT, as they often provide misleading outcomes, failing to capture the quality of document-level translation. Code and the outputs from GPT4-as-a-judge are available at https://github.com/EIT-NLP/BLEUless_DocMT

Paper Structure

This paper contains 22 sections, 1 equation, 10 figures, 8 tables.

Figures (10)

  • Figure 1: PCC Heatmaps among AvgBLEU, Fluency, CE, LE and GE for Vicuna-7B under DOC evaluation type in the en-zh translation direction.
  • Figure 2: Comparison of Vicuna-7B and Vicuna-7B-16K translations under ST3 and DOC evaluation types in the en-zh translation direction.
  • Figure 3: PCC Heatmaps among AvgBLEU, Fluency, CE, LE, GE for Vicuna-7B, Vicuna-13B, and Mistral-7B under ST3 and DOC evaluation types in translation direction of en-zh.
  • Figure 4: PCC Heatmaps among AvgBLEU, Fluency, CE(Content Errors), LE(Lexical Cohesion errors), GE(Grammatical Cohesion Errors) for Vicuna-7B, Vicuna-13B, and Mistral-7B under ST3 and DOC evaluation types in translation direction of zh-en.
  • Figure 5: PCC Heatmaps among AvgBLEU, Fluency, CE(Content Errors), LE(Lexical Cohesion errors), GE(Grammatical Cohesion Errors) for Vicuna-7B, Vicuna-13B, and Mistral-7B under ST3 and DOC evaluation types in translation direction of de-en.
  • ...and 5 more figures