Table of Contents
Fetching ...

Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

Neha Sharma, Navneet Agarwal, Kairit Sirts

TL;DR

This work tackles the inherent subjectivity of cognitive distortion detection by using multiple independent LLM runs to generate internally consistent annotations, demonstrating that recurrence across runs yields reliable labels and improves downstream task performance. It introduces a dataset-agnostic evaluation framework based on Cohen's kappa to enable fair cross-dataset comparisons, addressing variability in dataset size and label distributions. The key findings show that GPT-4-derived annotations achieve high inter-run reliability (Fleiss's Kappa around $0.78$) and that models trained on these labels outperform those trained on human gold labels across multiple tasks, with a principled $\kappa_{F1}$ metric enabling cross-study comparability. Together, these contributions offer a scalable workflow for reliable annotation in subjective NLP tasks and a principled evaluation approach for heterogeneous datasets.

Abstract

Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen's kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78), resulting in improved test set performance for models trained on these annotations compared to those trained on human-labeled data. Our findings suggest that LLMs can offer a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.

Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

TL;DR

This work tackles the inherent subjectivity of cognitive distortion detection by using multiple independent LLM runs to generate internally consistent annotations, demonstrating that recurrence across runs yields reliable labels and improves downstream task performance. It introduces a dataset-agnostic evaluation framework based on Cohen's kappa to enable fair cross-dataset comparisons, addressing variability in dataset size and label distributions. The key findings show that GPT-4-derived annotations achieve high inter-run reliability (Fleiss's Kappa around ) and that models trained on these labels outperform those trained on human gold labels across multiple tasks, with a principled metric enabling cross-study comparability. Together, these contributions offer a scalable workflow for reliable annotation in subjective NLP tasks and a principled evaluation approach for heterogeneous datasets.

Abstract

Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen's kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78), resulting in improved test set performance for models trained on these annotations compared to those trained on human-labeled data. Our findings suggest that LLMs can offer a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.

Paper Structure

This paper contains 35 sections, 17 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Example illustrating CD annotations for one user input across different GPT configurations, prompt types, and runs. (Here: pers. = Personalization, FT = Fortune Telling, ND = No Distortion, ER = Emotional Reasoning)
  • Figure 2: Distribution of maximum label repetitions across all configurations.