Table of Contents
Fetching ...

Automated Quality Assessment for LLM-Based Complex Qualitative Coding: A Confidence-Diversity Framework

Zhilong Zhao, Yindi Liu

TL;DR

This work tackles the scalability bottleneck in AI-assisted qualitative coding by introducing a confidence-diversity framework that combines external entropy ($H_{ext}$) and risk-based confidence ($\bar{c}$) to assess quality without ground-truth validation. Validated across legal reasoning, hyperpartisan political analysis, and medical transcription classification, the approach demonstrates robust predictive validity for dual signals, domain-specific weight optimization benefits, and strong cross-domain transferability. An intelligent triage system using a cost-benefit optimization further reduces manual verification by about 44.6% while sustaining quality. Collectively, the framework offers a principled, scalable, domain-agnostic quality assurance mechanism enabling larger and more diverse qualitative analyses with maintained rigor.

Abstract

Computational social science lacks a scalable and reliable mechanism to assure quality for AI-assisted qualitative coding when tasks demand domain expertise and long-text reasoning, and traditional double-coding is prohibitively costly at scale. We develop and validate a dual-signal quality assessment framework that combines model confidence with inter-model consensus (external entropy) and evaluate it across legal reasoning (390 Supreme Court cases), political analysis (645 hyperpartisan articles), and medical classification (1,000 clinical transcripts). External entropy is consistently negatively associated with accuracy (r = -0.179 to -0.273, p < 0.001), while confidence is positively associated in two domains (r = 0.104 to 0.429). Weight optimization improves over single-signal baselines by 6.6-113.7% and transfers across domains (100% success), and an intelligent triage protocol reduces manual verification effort by 44.6% while maintaining quality. The framework offers a principled, domain-agnostic quality assurance mechanism that scales qualitative coding without extensive double-coding, provides actionable guidance for sampling and verification, and enables larger and more diverse corpora to be analyzed with maintained rigor.

Automated Quality Assessment for LLM-Based Complex Qualitative Coding: A Confidence-Diversity Framework

TL;DR

This work tackles the scalability bottleneck in AI-assisted qualitative coding by introducing a confidence-diversity framework that combines external entropy () and risk-based confidence () to assess quality without ground-truth validation. Validated across legal reasoning, hyperpartisan political analysis, and medical transcription classification, the approach demonstrates robust predictive validity for dual signals, domain-specific weight optimization benefits, and strong cross-domain transferability. An intelligent triage system using a cost-benefit optimization further reduces manual verification by about 44.6% while sustaining quality. Collectively, the framework offers a principled, scalable, domain-agnostic quality assurance mechanism enabling larger and more diverse qualitative analyses with maintained rigor.

Abstract

Computational social science lacks a scalable and reliable mechanism to assure quality for AI-assisted qualitative coding when tasks demand domain expertise and long-text reasoning, and traditional double-coding is prohibitively costly at scale. We develop and validate a dual-signal quality assessment framework that combines model confidence with inter-model consensus (external entropy) and evaluate it across legal reasoning (390 Supreme Court cases), political analysis (645 hyperpartisan articles), and medical classification (1,000 clinical transcripts). External entropy is consistently negatively associated with accuracy (r = -0.179 to -0.273, p < 0.001), while confidence is positively associated in two domains (r = 0.104 to 0.429). Weight optimization improves over single-signal baselines by 6.6-113.7% and transfers across domains (100% success), and an intelligent triage protocol reduces manual verification effort by 44.6% while maintaining quality. The framework offers a principled, domain-agnostic quality assurance mechanism that scales qualitative coding without extensive double-coding, provides actionable guidance for sampling and verification, and enables larger and more diverse corpora to be analyzed with maintained rigor.

Paper Structure

This paper contains 28 sections, 7 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Weight Optimization Analysis and Cross-Domain Validation.Panel A: Cross-domain strategy performance comparison. Panel B: Optimal weights comparison across datasets. Panel C: Weight ratio analysis by domain. Panel D: 3D optimization surface for SCOTUS.
  • Figure 2: Enhanced Intelligent Quality Triage System Analysis (RQ4).Panel A: Tier performance comparison across domains. Panel B: Cost-effectiveness analysis across verification scenarios. Panel C: Statistical validation with 95% confidence intervals. Panel D: Cross-domain performance summary with quantitative averages.