Are Expert-Level Language Models Expert-Level Annotators?

Yu-Min Tseng; Wei-Lin Chen; Chung-Chi Chen; Hsin-Hsi Chen

Are Expert-Level Language Models Expert-Level Annotators?

Yu-Min Tseng, Wei-Lin Chen, Chung-Chi Chen, Hsin-Hsi Chen

TL;DR

This work presents the first systematic evaluation of LLMs as expert-level data annotators, investigating comprehensive approaches across three highly specialized domains and discussing practical suggestions from a cost-effectiveness perspective.

Abstract

Data annotation refers to the labeling or tagging of textual data with relevant information. A large body of works have reported positive results on leveraging LLMs as an alternative to human annotators. However, existing studies focus on classic NLP tasks, and the extent to which LLMs as data annotators perform in domains requiring expert knowledge remains underexplored. In this work, we investigate comprehensive approaches across three highly specialized domains and discuss practical suggestions from a cost-effectiveness perspective. To the best of our knowledge, we present the first systematic evaluation of LLMs as expert-level data annotators.

Are Expert-Level Language Models Expert-Level Annotators?

TL;DR

Abstract

Paper Structure (23 sections, 18 figures, 2 tables)

This paper contains 23 sections, 18 figures, 2 tables.

Introduction
Datasets
Finance
Biomedicine
Law
LLMs as Expert Annotators
Methods
Vanilla
CoT
Self-Consistency
Self-Refine
Results
Multi-Agent Annotation
Methods
Majority Vote
...and 8 more sections

Figures (18)

Figure 1: The degree of expert-level performance reached by state-of-the-art (SOTA) LLMs. For MMLU, we report model scores from the HELM liang2023holistic website divided by human-expert score (89.8) from hendrycks2020measuring.
Figure 2: The performance comparison of different single LLM settings ($S$) and multi-agent frameworks ($M$) across three domains. For the two single agent settings, numbers on the figure represent the average performance of the three single LLMs: GPT-4o, Gemini-1.5-Pro, and Claude-3-Opus, and red bars indicate the range of performance. An asterisk ($^*$) indicates that the method is statistically significant with p-value < 0.05.
Figure 3: An illustration of the cost-effectiveness relationship of various setups. The x-axis represents the cost per instance in USD, and the y-axis represents the accuracy in percentage. Note that the x-axis is counter-intuitive compared to the usual orientation, with higher costs on the left and lower costs on the right. The upper right corner of the figure indicates better performance, combining lower cost and higher accuracy.
Figure 4: Marginal performance of each LLM during multi-agent peer-discussion process. The performance in Round 0 indicates the LLMs' initial annotation performance, while the performance in Round 1 and Round 2 indicates the LLMs' annotation performance after one or two rounds discussion, respectively.
Figure 5: The annotation guideline of REFinD dataset.
...and 13 more figures

Are Expert-Level Language Models Expert-Level Annotators?

TL;DR

Abstract

Are Expert-Level Language Models Expert-Level Annotators?

Authors

TL;DR

Abstract

Table of Contents

Figures (18)