Table of Contents
Fetching ...

Life Cycle-Aware Evaluation of Knowledge Distillation for Machine Translation: Environmental Impact and Translation Quality Trade-offs

Joseph Attieh, Timothee Mickus, Anne-Laure Ligozat, Aurélie Névéol, Jörg Tiedemann

TL;DR

Knowledge distillation for machine translation often focuses on translation quality while neglecting environmental costs. This work applies MLCA to decompose emissions across teacher training, distillation, and inference, and to compare KD variants under explicit service volumes. It finds that upfront distillation costs dominate at low usage, while per-token inference costs dominate at scale, with word-level KD generally offering more favorable footprint-quality trade-offs than sequence-level KD. The study provides practical, transparent guidance for selecting KD methods and reporting environmental impact, enabling greener deployment of MT systems.

Abstract

Knowledge distillation (KD) is a tool to compress a larger system (teacher) into a smaller one (student). In machine translation, studies typically report only the translation quality of the student and omit the computational complexity of performing KD, making it difficult to select among the many available KD choices under compute-induced constraints. In this study, we evaluate representative KD methods by considering both translation quality and computational cost. We express computational cost as a carbon footprint using the machine learning life cycle assessment (MLCA) tool. This assessment accounts for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference). We find that (i) distillation overhead dominates the total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation. Our protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints.

Life Cycle-Aware Evaluation of Knowledge Distillation for Machine Translation: Environmental Impact and Translation Quality Trade-offs

TL;DR

Knowledge distillation for machine translation often focuses on translation quality while neglecting environmental costs. This work applies MLCA to decompose emissions across teacher training, distillation, and inference, and to compare KD variants under explicit service volumes. It finds that upfront distillation costs dominate at low usage, while per-token inference costs dominate at scale, with word-level KD generally offering more favorable footprint-quality trade-offs than sequence-level KD. The study provides practical, transparent guidance for selecting KD methods and reporting environmental impact, enabling greener deployment of MT systems.

Abstract

Knowledge distillation (KD) is a tool to compress a larger system (teacher) into a smaller one (student). In machine translation, studies typically report only the translation quality of the student and omit the computational complexity of performing KD, making it difficult to select among the many available KD choices under compute-induced constraints. In this study, we evaluate representative KD methods by considering both translation quality and computational cost. We express computational cost as a carbon footprint using the machine learning life cycle assessment (MLCA) tool. This assessment accounts for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference). We find that (i) distillation overhead dominates the total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation. Our protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints.
Paper Structure (29 sections, 4 equations, 6 figures, 2 tables)

This paper contains 29 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: System boundary of the MLCA highlighted by the red dashed line.
  • Figure 2: Total footprint (kgCO$_2$e) for all setups for a fixed served volume $X$. Non-teacher models are ordered by COMET score on FLORES+ test above each bar (not presented for simplicity).
  • Figure 3: Global Pareto frontiers across different student model sizes, accounting for total carbon footprint to produce a model for each method. Shaded regions show COMET CIs computed via paired bootstrap resampling over documents (N=1000, 95% confidence); the dashed line marks the teacher COMET.
  • Figure 4: Amortization of distillation cost vs. inference volume $X$. Each curve shows total life cycle emissions $I(X)=I_{\text{prod}} + X\cdot c_{\text{infer}}$ for deploying the teacher (black), a No-KD student (gray), and students (one curve per model on the Pareto frontier). Markers denote break-even points where students become less costly.
  • Figure 5: Size (in millions of sentence pairs) of the main parallel-corpus sources used for English-Icelandic.
  • ...and 1 more figures