Table of Contents
Fetching ...

LLM Confidence Evaluation Measures in Zero-Shot CSS Classification

David Farr, Iain Cruickshank, Nico Manzonelli, Nicholas Clark, Kate Starbird, Jevin West

TL;DR

An uncertainty quantification (UQ) performance measure tailored for data annotation tasks and a novel UQ aggregation strategy that effectively identifies low-confidence LLM annotations and disproportionately uncovers data incorrectly labeled by the LLMs are proposed.

Abstract

Assessing classification confidence is critical for leveraging large language models (LLMs) in automated labeling tasks, especially in the sensitive domains presented by Computational Social Science (CSS) tasks. In this paper, we make three key contributions: (1) we propose an uncertainty quantification (UQ) performance measure tailored for data annotation tasks, (2) we compare, for the first time, five different UQ strategies across three distinct LLMs and CSS data annotation tasks, (3) we introduce a novel UQ aggregation strategy that effectively identifies low-confidence LLM annotations and disproportionately uncovers data incorrectly labeled by the LLMs. Our results demonstrate that our proposed UQ aggregation strategy improves upon existing methods andcan be used to significantly improve human-in-the-loop data annotation processes.

LLM Confidence Evaluation Measures in Zero-Shot CSS Classification

TL;DR

An uncertainty quantification (UQ) performance measure tailored for data annotation tasks and a novel UQ aggregation strategy that effectively identifies low-confidence LLM annotations and disproportionately uncovers data incorrectly labeled by the LLMs are proposed.

Abstract

Assessing classification confidence is critical for leveraging large language models (LLMs) in automated labeling tasks, especially in the sensitive domains presented by Computational Social Science (CSS) tasks. In this paper, we make three key contributions: (1) we propose an uncertainty quantification (UQ) performance measure tailored for data annotation tasks, (2) we compare, for the first time, five different UQ strategies across three distinct LLMs and CSS data annotation tasks, (3) we introduce a novel UQ aggregation strategy that effectively identifies low-confidence LLM annotations and disproportionately uncovers data incorrectly labeled by the LLMs. Our results demonstrate that our proposed UQ aggregation strategy improves upon existing methods andcan be used to significantly improve human-in-the-loop data annotation processes.

Paper Structure

This paper contains 15 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Graph depicts the percent of incorrect data annotations identified given the amount of data sampled for stance detection via Flan UL2. This shows we can find approximately half of all incorrect data annotations by checking only the bottom 20% of data evaluated by our confidence ensemble method. This graph also is meant to show a natural understanding of why AUC is a valuable measure for uncertainty quantification when measuring by percent of false labels detected.
  • Figure :
  • Figure :
  • Figure :
  • Figure :