LlamBERT: Large-scale low-cost data annotation in NLP

Bálint Csanády; Lajos Muzsai; Péter Vedres; Zoltán Nádasdy; András Lukács

LlamBERT: Large-scale low-cost data annotation in NLP

Bálint Csanády, Lajos Muzsai, Péter Vedres, Zoltán Nádasdy, András Lukács

TL;DR

This work tackles the high cost of large-scale annotation by introducing LlamBERT, a hybrid framework that uses Llama 2 to label a manageable unlabeled subset and then fine-tunes compact transformers like BERT and RoBERTa on those labels, applying the learned model to the full corpus. Across IMDb and UMLS experiments, LlamBERT achieves near-baseline accuracy with substantially reduced labeling cost, and the combination of LLM-labeled data with gold-standard supervision yields state-of-the-art results on IMDb. The approach demonstrates that resource-efficient labeling via LLMs, followed by supervised fine-tuning of smaller encoders, can scale annotation while maintaining strong performance. The results suggest practical benefits for large-scale NLP tasks where labeling cost is a bottleneck, with future work exploring PEFT techniques to further enhance data efficiency.

Abstract

Large Language Models (LLMs), such as GPT-4 and Llama 2, show remarkable proficiency in a wide range of natural language processing (NLP) tasks. Despite their effectiveness, the high costs associated with their use pose a challenge. We present LlamBERT, a hybrid approach that leverages LLMs to annotate a small subset of large, unlabeled databases and uses the results for fine-tuning transformer encoders like BERT and RoBERTa. This strategy is evaluated on two diverse datasets: the IMDb review dataset and the UMLS Meta-Thesaurus. Our results indicate that the LlamBERT approach slightly compromises on accuracy while offering much greater cost-effectiveness.

LlamBERT: Large-scale low-cost data annotation in NLP

TL;DR

Abstract

LlamBERT: Large-scale low-cost data annotation in NLP

Authors

TL;DR

Abstract

Table of Contents

Figures (1)