Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral
António Farinhas, Nuno M. Guerreiro, Sweta Agrawal, Ricardo Rei, André F. T. Martins
TL;DR
This paper tackles the high computational cost of large neural MT models by proposing a cascaded translation system that defaults to a lightweight model and defers harder instances to a larger model. The deferral rule is based on existing quality estimation (QE) metrics, enabling a simple, budget-aware mechanism that does not require task-specific deferral training. Across WMT24, using QE-based deferral with small-to-large model cascades matches the large model’s quality while deferring only a portion of inputs, and this finding is supported by both automatic metrics and human evaluation. The approach demonstrates substantial practical impact by reducing compute costs while preserving translation quality, and it offers a scalable avenue to apply QE-based deferral to other NLP tasks and multi-stage cascades as QE signals improve.
Abstract
Larger models often outperform smaller ones but come with high computational costs. Cascading offers a potential solution. By default, it uses smaller models and defers only some instances to larger, more powerful models. However, designing effective deferral rules remains a challenge. In this paper, we propose a simple yet effective approach for machine translation, using existing quality estimation (QE) metrics as deferral rules. We show that QE-based deferral allows a cascaded system to match the performance of a larger model while invoking it for a small fraction (30% to 50%) of the examples, significantly reducing computational costs. We validate this approach through both automatic and human evaluation.
