Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications
Fadel M. Megahed, Ying-Ju Chen, L. Allision Jones-Farmer, Younghwa Lee, Jiawei Brooke Wang, Inez M. Zwetsloot
TL;DR
The paper tackles the absence of standardized reliability assessment for LLM-driven binary text classification by introducing a four-phase, psychometrically grounded framework (planning, data collection, reliability analysis, validity analysis). It demonstrates the framework with a financial sentiment case study across 14 LLMs, showing high intra- and inter-model reliability, with smaller models matching or exceeding larger ones in consistency and cost-efficiency, yet limited predictive validity for actual market movements. Key contributions include explicit sample-size planning for reliability metrics, a robust approach to handling invalid outputs, and practical guidance for model selection and annotation pipelines. The work provides open-source resources to foster reproducibility and encourages task-specific evaluation over model size when deploying LLMs for structured classification tasks.
Abstract
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.
