Table of Contents
Fetching ...

The Hidden Costs of Translation Accuracy: Distillation, Quantization, and Environmental Impact

Dhaathri Vijay, Anandaswarup Vadapalli

TL;DR

The paper addresses the high computational and environmental costs of large language models by examining how distillation and quantization affect translation quality and efficiency. It compares full, distilled, and quantized NLLB-200 variants using Flores+ BLEU benchmarks and human judgments across French, Hindi, and Kannada. Findings show that distillation dramatically reduces latency (71–78%) and carbon emissions (63–65%) with only modest BLEU losses, while aggressive INT4 quantization can preserve accuracy and fluency in several cases, though low-resource Kannada experiences sharper degradation. The work underscores the value of incorporating efficiency and sustainability into NLP progress and calls for evaluation frameworks that balance accuracy with environmental impact for responsible deployment.

Abstract

The rapid expansion of large language models (LLMs) has heightened concerns about their computational and environmental costs. This study investigates the trade-offs between translation quality and efficiency by comparing full-scale, distilled, and quantized models using machine translation as a case study. We evaluated performance on the Flores+ benchmark and through human judgments of conversational translations in French, Hindi, and Kannada. Our analysis revealed that the full 3.3B FP32 model, while achieving the highest BLEU scores, incurred the largest environmental footprint (~ 0.007-0.008 kg CO2 per run). The distilled 600M FP32 model reduced inference time by 71-78% and carbon emissions by 63-65% compared with the full model, with only minimal reductions in BLEU scores. Human evaluations further showed that even aggressive quantization (INT4) preserved high levels of accuracy and fluency, with differences between models generally minor. These findings demonstrate that model compression strategies can substantially reduce computational demands and environmental impact while maintaining competitive translation quality, though trade-offs are more pronounced in low-resource settings. We argue for evaluation frameworks that integrate efficiency and sustainability alongside accuracy as central dimensions of progress in NLP.

The Hidden Costs of Translation Accuracy: Distillation, Quantization, and Environmental Impact

TL;DR

The paper addresses the high computational and environmental costs of large language models by examining how distillation and quantization affect translation quality and efficiency. It compares full, distilled, and quantized NLLB-200 variants using Flores+ BLEU benchmarks and human judgments across French, Hindi, and Kannada. Findings show that distillation dramatically reduces latency (71–78%) and carbon emissions (63–65%) with only modest BLEU losses, while aggressive INT4 quantization can preserve accuracy and fluency in several cases, though low-resource Kannada experiences sharper degradation. The work underscores the value of incorporating efficiency and sustainability into NLP progress and calls for evaluation frameworks that balance accuracy with environmental impact for responsible deployment.

Abstract

The rapid expansion of large language models (LLMs) has heightened concerns about their computational and environmental costs. This study investigates the trade-offs between translation quality and efficiency by comparing full-scale, distilled, and quantized models using machine translation as a case study. We evaluated performance on the Flores+ benchmark and through human judgments of conversational translations in French, Hindi, and Kannada. Our analysis revealed that the full 3.3B FP32 model, while achieving the highest BLEU scores, incurred the largest environmental footprint (~ 0.007-0.008 kg CO2 per run). The distilled 600M FP32 model reduced inference time by 71-78% and carbon emissions by 63-65% compared with the full model, with only minimal reductions in BLEU scores. Human evaluations further showed that even aggressive quantization (INT4) preserved high levels of accuracy and fluency, with differences between models generally minor. These findings demonstrate that model compression strategies can substantially reduce computational demands and environmental impact while maintaining competitive translation quality, though trade-offs are more pronounced in low-resource settings. We argue for evaluation frameworks that integrate efficiency and sustainability alongside accuracy as central dimensions of progress in NLP.

Paper Structure

This paper contains 10 sections, 3 tables.