On Importance of Pruning and Distillation for Efficient Low Resource NLP
Aishwarya Mirashi, Purva Lingayat, Srushti Sonavane, Tejas Padhiyar, Raviraj Joshi, Geetanjali Kale
TL;DR
This work tackles the resource inefficiency of large transformer models in low-resource NLP by optimizing Marathi models. It applies Block Movement Pruning, Knowledge Distillation, and Mixed Precision to the Marathi baseline marathi-topic-all-doc-v2 on the MahaNews dataset, evaluating pruning levels of 25%, 50%, and 75% and their combinations with distillation. The study demonstrates that a 25% pruning level combined with knowledge distillation delivers the best time-accuracy trade-off, achieving up to 2.56x speedups with near-baseline accuracy and reduced model size, while also analyzing environmental impact via CodeCarbon. These results offer a practical path toward greener, faster NLP for low-resource languages and highlight the importance of efficiency-aware model design for real-world deployment.
Abstract
The rise of large transformer models has revolutionized Natural Language Processing, leading to significant advances in tasks like text classification. However, this progress demands substantial computational resources, escalating training duration, and expenses with larger model sizes. Efforts have been made to downsize and accelerate English models (e.g., Distilbert, MobileBert). Yet, research in this area is scarce for low-resource languages. In this study, we explore the case of the low-resource Indic language Marathi. Leveraging the marathi-topic-all-doc-v2 model as our baseline, we implement optimization techniques to reduce computation time and memory usage. Our focus is on enhancing the efficiency of Marathi transformer models while maintaining top-tier accuracy and reducing computational demands. Using the MahaNews document classification dataset and the marathi-topic-all-doc-v2 model from L3Cube, we apply Block Movement Pruning, Knowledge Distillation, and Mixed Precision methods individually and in combination to boost efficiency. We demonstrate the importance of strategic pruning levels in achieving desired efficiency gains. Furthermore, we analyze the balance between efficiency improvements and environmental impact, highlighting how optimized model architectures can contribute to a more sustainable computational ecosystem. Implementing these techniques on a single GPU system, we determine that the optimal configuration is 25\% pruning + knowledge distillation. This approach yielded a 2.56x speedup in computation time while maintaining baseline accuracy levels.
