Table of Contents
Fetching ...

Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral

António Farinhas, Nuno M. Guerreiro, Sweta Agrawal, Ricardo Rei, André F. T. Martins

TL;DR

This paper tackles the high computational cost of large neural MT models by proposing a cascaded translation system that defaults to a lightweight model and defers harder instances to a larger model. The deferral rule is based on existing quality estimation (QE) metrics, enabling a simple, budget-aware mechanism that does not require task-specific deferral training. Across WMT24, using QE-based deferral with small-to-large model cascades matches the large model’s quality while deferring only a portion of inputs, and this finding is supported by both automatic metrics and human evaluation. The approach demonstrates substantial practical impact by reducing compute costs while preserving translation quality, and it offers a scalable avenue to apply QE-based deferral to other NLP tasks and multi-stage cascades as QE signals improve.

Abstract

Larger models often outperform smaller ones but come with high computational costs. Cascading offers a potential solution. By default, it uses smaller models and defers only some instances to larger, more powerful models. However, designing effective deferral rules remains a challenge. In this paper, we propose a simple yet effective approach for machine translation, using existing quality estimation (QE) metrics as deferral rules. We show that QE-based deferral allows a cascaded system to match the performance of a larger model while invoking it for a small fraction (30% to 50%) of the examples, significantly reducing computational costs. We validate this approach through both automatic and human evaluation.

Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral

TL;DR

This paper tackles the high computational cost of large neural MT models by proposing a cascaded translation system that defaults to a lightweight model and defers harder instances to a larger model. The deferral rule is based on existing quality estimation (QE) metrics, enabling a simple, budget-aware mechanism that does not require task-specific deferral training. Across WMT24, using QE-based deferral with small-to-large model cascades matches the large model’s quality while deferring only a portion of inputs, and this finding is supported by both automatic metrics and human evaluation. The approach demonstrates substantial practical impact by reducing compute costs while preserving translation quality, and it offers a scalable avenue to apply QE-based deferral to other NLP tasks and multi-stage cascades as QE signals improve.

Abstract

Larger models often outperform smaller ones but come with high computational costs. Cascading offers a potential solution. By default, it uses smaller models and defers only some instances to larger, more powerful models. However, designing effective deferral rules remains a challenge. In this paper, we propose a simple yet effective approach for machine translation, using existing quality estimation (QE) metrics as deferral rules. We show that QE-based deferral allows a cascaded system to match the performance of a larger model while invoking it for a small fraction (30% to 50%) of the examples, significantly reducing computational costs. We validate this approach through both automatic and human evaluation.

Paper Structure

This paper contains 25 sections, 9 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Cascaded translation system with QE-based deferral. A small model translates a batch of source sentences, and a relatively lightweight QE model scores the hypotheses. Sources with the lowest-scoring translations are deferred to a larger model. The extent of deferral is determined by a predefined compute budget.
  • Figure 2: Translation quality of cascading combining Tower-v2 7B and Tower-v2 70B according to MetricX, as the inference computation budget varies. Horizontal lines show the performance of each model alone.
  • Figure 3: Translation quality of cascaded systems with deferral based on wmt22-cometkiwi-da. Large model: Tower-v2 70B. Small models: Tower-v2 7B (L), Tower-v2 7B (top); EuroLLM 1.7B, EuroLLM 9B (bottom).
  • Figure 4: Translation quality of a cascaded system combining Tower-v2 7B and Tower-v2 70B according to human scores (in a scale from 0 to 100), as the inference computation budget varies. Systems in the shaded area are not significantly different from Tower-v2 70B according to the paired-permutation test with $p=0.01$.
  • Figure 5: Translation quality of a cascaded system combining Tower-v2 7B and Tower-v2 70B (in green) v.s. QE reranking with hypotheses generated by Tower-v2 7B (in orange), measured with MetricX, as $X$ varies. Horizontal lines show the performance of the smaller and larger models alone.
  • ...and 3 more figures