Life Cycle-Aware Evaluation of Knowledge Distillation for Machine Translation: Environmental Impact and Translation Quality Trade-offs
Joseph Attieh, Timothee Mickus, Anne-Laure Ligozat, Aurélie Névéol, Jörg Tiedemann
TL;DR
Knowledge distillation for machine translation often focuses on translation quality while neglecting environmental costs. This work applies MLCA to decompose emissions across teacher training, distillation, and inference, and to compare KD variants under explicit service volumes. It finds that upfront distillation costs dominate at low usage, while per-token inference costs dominate at scale, with word-level KD generally offering more favorable footprint-quality trade-offs than sequence-level KD. The study provides practical, transparent guidance for selecting KD methods and reporting environmental impact, enabling greener deployment of MT systems.
Abstract
Knowledge distillation (KD) is a tool to compress a larger system (teacher) into a smaller one (student). In machine translation, studies typically report only the translation quality of the student and omit the computational complexity of performing KD, making it difficult to select among the many available KD choices under compute-induced constraints. In this study, we evaluate representative KD methods by considering both translation quality and computational cost. We express computational cost as a carbon footprint using the machine learning life cycle assessment (MLCA) tool. This assessment accounts for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference). We find that (i) distillation overhead dominates the total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation. Our protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints.
