Table of Contents
Fetching ...

LLMs as Data Annotators: How Close Are We to Human Performance

Muhammad Uzair Ul Haq, Davide Rigoni, Alessandro Sperduti

TL;DR

The paper tackles the data annotation bottleneck in NLP by evaluating LLM based NER annotation across zero shot, in context learning, and retrieval augmented generation. It systematically compares models of approximately 7B and 70B parameters, using two embedding methods and four diverse datasets to measure how close LLMs come to human performance. Results show that retrieval augmented approaches generally outperform baselines and ICL, with large models and OpenAI embeddings achieving near-human performance on structured datasets such as CoNLL-2003, while challenging datasets like SKILLSPAN reveal remaining gaps. The work provides actionable guidance on LLM and embedding selection, context strategy, and highlights the value of more demanding benchmarks and advanced retrieval techniques for scaling data annotation pipelines.

Abstract

In NLP, fine-tuning LLMs is effective for various applications but requires high-quality annotated data. However, manual annotation of data is labor-intensive, time-consuming, and costly. Therefore, LLMs are increasingly used to automate the process, often employing in-context learning (ICL) in which some examples related to the task are given in the prompt for better performance. However, manually selecting context examples can lead to inefficiencies and suboptimal model performance. This paper presents comprehensive experiments comparing several LLMs, considering different embedding models, across various datasets for the Named Entity Recognition (NER) task. The evaluation encompasses models with approximately $7$B and $70$B parameters, including both proprietary and non-proprietary models. Furthermore, leveraging the success of Retrieval-Augmented Generation (RAG), it also considers a method that addresses the limitations of ICL by automatically retrieving contextual examples, thereby enhancing performance. The results highlight the importance of selecting the appropriate LLM and embedding model, understanding the trade-offs between LLM sizes and desired performance, and the necessity to direct research efforts towards more challenging datasets.

LLMs as Data Annotators: How Close Are We to Human Performance

TL;DR

The paper tackles the data annotation bottleneck in NLP by evaluating LLM based NER annotation across zero shot, in context learning, and retrieval augmented generation. It systematically compares models of approximately 7B and 70B parameters, using two embedding methods and four diverse datasets to measure how close LLMs come to human performance. Results show that retrieval augmented approaches generally outperform baselines and ICL, with large models and OpenAI embeddings achieving near-human performance on structured datasets such as CoNLL-2003, while challenging datasets like SKILLSPAN reveal remaining gaps. The work provides actionable guidance on LLM and embedding selection, context strategy, and highlights the value of more demanding benchmarks and advanced retrieval techniques for scaling data annotation pipelines.

Abstract

In NLP, fine-tuning LLMs is effective for various applications but requires high-quality annotated data. However, manual annotation of data is labor-intensive, time-consuming, and costly. Therefore, LLMs are increasingly used to automate the process, often employing in-context learning (ICL) in which some examples related to the task are given in the prompt for better performance. However, manually selecting context examples can lead to inefficiencies and suboptimal model performance. This paper presents comprehensive experiments comparing several LLMs, considering different embedding models, across various datasets for the Named Entity Recognition (NER) task. The evaluation encompasses models with approximately B and B parameters, including both proprietary and non-proprietary models. Furthermore, leveraging the success of Retrieval-Augmented Generation (RAG), it also considers a method that addresses the limitations of ICL by automatically retrieving contextual examples, thereby enhancing performance. The results highlight the importance of selecting the appropriate LLM and embedding model, understanding the trade-offs between LLM sizes and desired performance, and the necessity to direct research efforts towards more challenging datasets.

Paper Structure

This paper contains 39 sections, 1 equation, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Workflow of the proposed approach. $\mathcal{D}_{train}$ denotes the training data, $\mathcal{X}$ denotes the few human annotated examples, whereas $\mathcal{T}$ denotes the training instances to be annotated by LLM. For each entry $\mathcal{T}_i \in \mathcal{T}$, we extract $\mathcal{M}$ context examples from a vector store using a retriever module. Then, given an input sentence, the final prompt to LLM consists of the task description, the context examples in $\mathcal{M}$, and input sentence.
  • Figure 2: Heatmaps of the $F_1$ scores across four datasets. The color scale represents performance, with red indicating higher scores reaching human-level, and blue indicating lower scores starting from the lowest performing model
  • Figure 3: $F_1$ scores for different context sizes ($25$, $50$, and $75$) and sample spaces ($10$% and $20$%) for the RAG and ICL approach on the SKILLSPAN dataset, using the gpt-4o-mini model. The plot indicates that with a smaller sample size, the RAG approach performs comparably to ICL.
  • Figure 4: Critical Difference diagram of average score ranks. The models connected with horizontal line shows no statistical difference. The models with lower ranks shows superior performance than those of higher ranks.