Table of Contents
Fetching ...

Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications

Fadel M. Megahed, Ying-Ju Chen, L. Allision Jones-Farmer, Younghwa Lee, Jiawei Brooke Wang, Inez M. Zwetsloot

TL;DR

The paper tackles the absence of standardized reliability assessment for LLM-driven binary text classification by introducing a four-phase, psychometrically grounded framework (planning, data collection, reliability analysis, validity analysis). It demonstrates the framework with a financial sentiment case study across 14 LLMs, showing high intra- and inter-model reliability, with smaller models matching or exceeding larger ones in consistency and cost-efficiency, yet limited predictive validity for actual market movements. Key contributions include explicit sample-size planning for reliability metrics, a robust approach to handling invalid outputs, and practical guidance for model selection and annotation pipelines. The work provides open-source resources to foster reproducibility and encourages task-specific evaluation over model size when deploying LLMs for structured classification tasks.

Abstract

This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.

Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications

TL;DR

The paper tackles the absence of standardized reliability assessment for LLM-driven binary text classification by introducing a four-phase, psychometrically grounded framework (planning, data collection, reliability analysis, validity analysis). It demonstrates the framework with a financial sentiment case study across 14 LLMs, showing high intra- and inter-model reliability, with smaller models matching or exceeding larger ones in consistency and cost-efficiency, yet limited predictive validity for actual market movements. Key contributions include explicit sample-size planning for reliability metrics, a robust approach to handling invalid outputs, and practical guidance for model selection and annotation pipelines. The work provides open-source resources to foster reproducibility and encourages task-specific evaluation over model size when deploying LLMs for structured classification tasks.

Abstract

This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.

Paper Structure

This paper contains 28 sections, 4 equations, 7 figures.

Figures (7)

  • Figure 1: Our framework for evaluating LLM consistency in binary text annotation tasks.
  • Figure 2: Minimum sample sizes needed for the Gwet AC1, Brennan-Prediger coefficient (BP), and simple percent agreement (PA) for 5, 7, and 9 raters. The red value denotes the conservative (maximum sample size) across the three metrics.
  • Figure 3: Selected proprietary and open-weight LLMs. Within an LLM developer, the top row indicates more performative models. The cost is per million output tokens (MTok). We provide each model's cost and features according to their provider's description.
  • Figure 4: Comparison of LLM inference times and generated sentiment distributions. An "invalid" label is assigned when the model output cannot be mapped to the required "positive" or "negative" categories. For example, the model responds with terms like "neutral" or "unsure," or when the output cannot be parsed into either label.
  • Figure 5: The distribution of the intra-LLM reliability coefficient estimates. AC1, and Fleiss' Kappa cannot be computed when all five raters (replicates) provide NA (i.e., "invalid") labels for a given news article. Hence, models with only invalid labels for certain items have no values for these metrics in the dot-plot with error bars and the coefficient estimate tables (coefficient names were abbreviated for space considerations).
  • ...and 2 more figures