Table of Contents
Fetching ...

Wisdom of Instruction-Tuned Language Model Crowds. Exploring Model Label Variation

Flor Miriam Plaza-del-Arco, Debora Nozza, Dirk Hovy

TL;DR

This study treats instruction-tuned LLMs as annotators for subjective text classification tasks across multiple languages, examining how model specialization and label disagreement can be exploited via aggregation. Using zero-shot and few-shot prompting across five tasks and two datasets, the authors show that aggregating labels with majority voting or Bayesian MACE often surpasses any individual LLM, though still lags behind supervised models trained on labeled data. Few-shot prompting provides little consistent gain and introduces high task-dependent variance, while no information-theoretic seed selection reliably improves performance. The findings suggest cost-effective benefits from aggregating diverse LLMs, but emphasize that human annotation or supervised learning remains superior for accuracy, with ethical considerations shaping the adoption of LLM-based annotation in practice.

Abstract

Large Language Models (LLMs) exhibit remarkable text classification capabilities, excelling in zero- and few-shot learning (ZSL and FSL) scenarios. However, since they are trained on different datasets, performance varies widely across tasks between those models. Recent studies emphasize the importance of considering human label variation in data annotation. However, how this human label variation also applies to LLMs remains unexplored. Given this likely model specialization, we ask: Do aggregate LLM labels improve over individual models (as for human annotators)? We evaluate four recent instruction-tuned LLMs as annotators on five subjective tasks across four languages. We use ZSL and FSL setups and label aggregation from human annotation. Aggregations are indeed substantially better than any individual model, benefiting from specialization in diverse tasks or languages. Surprisingly, FSL does not surpass ZSL, as it depends on the quality of the selected examples. However, there seems to be no good information-theoretical strategy to select those. We find that no LLM method rivals even simple supervised models. We also discuss the tradeoffs in accuracy, cost, and moral/ethical considerations between LLM and human annotation.

Wisdom of Instruction-Tuned Language Model Crowds. Exploring Model Label Variation

TL;DR

This study treats instruction-tuned LLMs as annotators for subjective text classification tasks across multiple languages, examining how model specialization and label disagreement can be exploited via aggregation. Using zero-shot and few-shot prompting across five tasks and two datasets, the authors show that aggregating labels with majority voting or Bayesian MACE often surpasses any individual LLM, though still lags behind supervised models trained on labeled data. Few-shot prompting provides little consistent gain and introduces high task-dependent variance, while no information-theoretic seed selection reliably improves performance. The findings suggest cost-effective benefits from aggregating diverse LLMs, but emphasize that human annotation or supervised learning remains superior for accuracy, with ethical considerations shaping the adoption of LLM-based annotation in practice.

Abstract

Large Language Models (LLMs) exhibit remarkable text classification capabilities, excelling in zero- and few-shot learning (ZSL and FSL) scenarios. However, since they are trained on different datasets, performance varies widely across tasks between those models. Recent studies emphasize the importance of considering human label variation in data annotation. However, how this human label variation also applies to LLMs remains unexplored. Given this likely model specialization, we ask: Do aggregate LLM labels improve over individual models (as for human annotators)? We evaluate four recent instruction-tuned LLMs as annotators on five subjective tasks across four languages. We use ZSL and FSL setups and label aggregation from human annotation. Aggregations are indeed substantially better than any individual model, benefiting from specialization in diverse tasks or languages. Surprisingly, FSL does not surpass ZSL, as it depends on the quality of the selected examples. However, there seems to be no good information-theoretical strategy to select those. We find that no LLM method rivals even simple supervised models. We also discuss the tradeoffs in accuracy, cost, and moral/ethical considerations between LLM and human annotation.
Paper Structure (20 sections, 2 figures, 4 tables)

This paper contains 20 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Instructions used to prompt the instruction-tuned LLMs for each classification task.
  • Figure 2: ZSL vs. FSL Macro-F1 scores on English Trustpilot tasks. FSL sample selection strategies: Low Entropy ($\downarrow$ E), Max Entropy ($\uparrow$ E), and Random (Rand). All FSL methods show much greater variance than ZSL.