Granting GPT-4 License and Opportunity: Enhancing Accuracy and Confidence Estimation for Few-Shot Event Detection

Steven Fincke; Adrien Bibal; Elizabeth Boschee

Granting GPT-4 License and Opportunity: Enhancing Accuracy and Confidence Estimation for Few-Shot Event Detection

Steven Fincke, Adrien Bibal, Elizabeth Boschee

TL;DR

This work addresses confidence estimation in few-shot event detection using large language models by introducing License to Speculate and Opportunity (L&O) prompting. L&O expands prompts to elicit guesses, explanations, and a 1–5 confidence rating from GPT-4, without model fine-tuning or access to internal statistics. The approach yields usable confidence measures and improves F1 on select BETTER ontology topics, achieving ROC AUC up to $0.759$ and demonstrating the value of explanations for calibration. The results suggest that explicitly enabling speculation and justification in prompts can make LLM-based annotation pipelines more reliable and scalable for ontology development and silver-data generation.

Abstract

Large Language Models (LLMs) such as GPT-4 have shown enough promise in the few-shot learning context to suggest use in the generation of "silver" data and refinement of new ontologies through iterative application and review. Such workflows become more effective with reliable confidence estimation. Unfortunately, confidence estimation is a documented weakness of models such as GPT-4, and established methods to compensate require significant additional complexity and computation. The present effort explores methods for effective confidence estimation with GPT-4 with few-shot learning for event detection in the BETTER ontology as a vehicle. The key innovation is expanding the prompt and task presented to GPT-4 to provide License to speculate when unsure and Opportunity to quantify and explain its uncertainty (L&O). This approach improves accuracy and provides usable confidence measures (0.759 AUC) with no additional machinery.

Granting GPT-4 License and Opportunity: Enhancing Accuracy and Confidence Estimation for Few-Shot Event Detection

TL;DR

and demonstrating the value of explanations for calibration. The results suggest that explicitly enabling speculation and justification in prompts can make LLM-based annotation pipelines more reliable and scalable for ontology development and silver-data generation.

Abstract

Paper Structure (11 sections, 3 figures, 3 tables)

This paper contains 11 sections, 3 figures, 3 tables.

Introduction
Related Work
Data and Task
System
Task and prompting details
Scoring
Ablation studies
Discussion
Future Work
Conclusion
Limitations

Figures (3)

Figure 1: Sample prompt and output for Disease-Kills within the Disease topic.
Figure 2: AUC plot for three topics with diameter proportional to the number of outputs at the specified confidence level.
Figure 3: AUC plot for individual confidence levels for our full system and ablation variants. The diameter proportional to the number of outputs at the specified confidence level.

Granting GPT-4 License and Opportunity: Enhancing Accuracy and Confidence Estimation for Few-Shot Event Detection

TL;DR

Abstract

Granting GPT-4 License and Opportunity: Enhancing Accuracy and Confidence Estimation for Few-Shot Event Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (3)