Exploring Variability in Fine-Tuned Models for Text Classification with DistilBERT
Giuliano Lorenzoni, Ivens Portugal, Paulo Alencar, Donald Cowan
TL;DR
The paper investigates how hyperparameter interactions shape performance when fine-tuning DistilBERT for text classification, focusing on learning rate, batch size, and epochs. It employs two polynomial regression frameworks—absolute (baseline-focused) and relative (baseline-difference)—to analyze accuracy, F1-score, and loss, using 55 DistilBERT variants from Hugging Face. Key findings show that batch size establishes a solid accuracy and F1 foundation, while learning rate drives incremental gains in relative terms, with the interaction between epochs and batch size being crucial for F1 optimization. The work advocates adaptive, metric-aware fine-tuning frameworks that account for non-linear hyperparameter effects and cross-metric trade-offs, with implications extending to NLP and CV tasks and to broader LLM tuning practices.
Abstract
This study evaluates fine-tuning strategies for text classification using the DistilBERT model, specifically the distilbert-base-uncased-finetuned-sst-2-english variant. Through structured experiments, we examine the influence of hyperparameters such as learning rate, batch size, and epochs on accuracy, F1-score, and loss. Polynomial regression analyses capture foundational and incremental impacts of these hyperparameters, focusing on fine-tuning adjustments relative to a baseline model. Results reveal variability in metrics due to hyperparameter configurations, showing trade-offs among performance metrics. For example, a higher learning rate reduces loss in relative analysis (p=0.027) but challenges accuracy improvements. Meanwhile, batch size significantly impacts accuracy and F1-score in absolute regression (p=0.028 and p=0.005) but has limited influence on loss optimization (p=0.170). The interaction between epochs and batch size maximizes F1-score (p=0.001), underscoring the importance of hyperparameter interplay. These findings highlight the need for fine-tuning strategies addressing non-linear hyperparameter interactions to balance performance across metrics. Such variability and metric trade-offs are relevant for tasks beyond text classification, including NLP and computer vision. This analysis informs fine-tuning strategies for large language models and promotes adaptive designs for broader model applicability.
