LlamBERT: Large-scale low-cost data annotation in NLP
Bálint Csanády, Lajos Muzsai, Péter Vedres, Zoltán Nádasdy, András Lukács
TL;DR
This work tackles the high cost of large-scale annotation by introducing LlamBERT, a hybrid framework that uses Llama 2 to label a manageable unlabeled subset and then fine-tunes compact transformers like BERT and RoBERTa on those labels, applying the learned model to the full corpus. Across IMDb and UMLS experiments, LlamBERT achieves near-baseline accuracy with substantially reduced labeling cost, and the combination of LLM-labeled data with gold-standard supervision yields state-of-the-art results on IMDb. The approach demonstrates that resource-efficient labeling via LLMs, followed by supervised fine-tuning of smaller encoders, can scale annotation while maintaining strong performance. The results suggest practical benefits for large-scale NLP tasks where labeling cost is a bottleneck, with future work exploring PEFT techniques to further enhance data efficiency.
Abstract
Large Language Models (LLMs), such as GPT-4 and Llama 2, show remarkable proficiency in a wide range of natural language processing (NLP) tasks. Despite their effectiveness, the high costs associated with their use pose a challenge. We present LlamBERT, a hybrid approach that leverages LLMs to annotate a small subset of large, unlabeled databases and uses the results for fine-tuning transformer encoders like BERT and RoBERTa. This strategy is evaluated on two diverse datasets: the IMDb review dataset and the UMLS Meta-Thesaurus. Our results indicate that the LlamBERT approach slightly compromises on accuracy while offering much greater cost-effectiveness.
