Feeding LLM Annotations to BERT Classifiers at Your Own Risk
Yucheng Lu, Kazimier Smith
TL;DR
This work assesses the reliability of using LLM-generated labels to fine-tune encoder-only text classifiers such as RoBERTa. It shows that synthetic labels introduce non-random degradation, unstable predictions, and premature learning plateaus, with the effects worsening on complex and imbalanced tasks. The authors formalize the issue with a KL-divergence framework, decomposing the error into an irreducible approximation term $KL(P\|P_S)$ and an estimation term, explaining why simple mitigations struggle. They evaluate two lightweight remedies—entropy-based filtering and consistency ensembles—finding partial improvements but no robust solution, underscoring the need for caution in high-stakes applications and for developing more principled data-quality controls. Overall, the results imply that synthetic labeling can be cost-effective for simple tasks but risks substantial reliability and stability problems in real-world text classification.
Abstract
Using LLM-generated labels to fine-tune smaller encoder-only models for text classification has gained popularity in various settings. While this approach may be justified in simple and low-stakes applications, we conduct empirical analysis to demonstrate how the perennial curse of training on synthetic data manifests itself in this specific setup. Compared to models trained on gold labels, we observe not only the expected performance degradation in accuracy and F1 score, but also increased instability across training runs and premature performance plateaus. These findings cast doubts on the reliability of such approaches in real-world applications. We contextualize the observed phenomena through the lens of error propagation and offer several practical mitigation strategies, including entropy-based filtering and ensemble techniques. Although these heuristics offer partial relief, they do not fully resolve the inherent risks of propagating non-random errors from LLM annotations to smaller classifiers, underscoring the need for caution when applying this workflow in high-stakes text classification tasks.
