It's Difficult to be Neutral -- Human and LLM-based Sentiment Annotation of Patient Comments
Petter Mæhlum, David Samuel, Rebecka Maria Norman, Elma Jelin, Øyvind Andresen Bjertnæs, Lilja Øvrelid, Erik Velldal
TL;DR
This work addresses the challenge of extracting sentiment from Norwegian patient feedback by building a manually annotated in-domain corpus and evaluating large language models as annotation substitutes. It inventories a four-class sentiment scheme (positive, negative, mixed, neutral) at comment- and sentence-level and examines human inter-annotator agreement to establish a quality baseline. The authors test two open-source Norwegian LLMs, ChatNorT5 and NorMistral, using likelihood-based scoring and carefully designed prompts (including zero- and few-shot regimes) to classify sentiment, finding strong performance for binary sentiment but persistent difficulties with neutral and mixed cases. The results demonstrate that while LLMs can match or approach human performance on simple binary tasks and offer a privacy-preserving, scalable option, human annotation remains essential for reliable full-spectrum sentiment analyses of sensitive health data. The study highlights prompt sensitivity, model differences, and privacy considerations that are critical for deploying sentiment analysis in healthcare contexts and guiding future improvements in language-model-based annotation. Open-source, locally runnable models emerge as a prudent choice to mitigate privacy and bias concerns while enabling iterative, domain-adaptive annotation workflows.
Abstract
Sentiment analysis is an important tool for aggregating patient voices, in order to provide targeted improvements in healthcare services. A prerequisite for this is the availability of in-domain data annotated for sentiment. This article documents an effort to add sentiment annotations to free-text comments in patient surveys collected by the Norwegian Institute of Public Health (NIPH). However, annotation can be a time-consuming and resource-intensive process, particularly when it requires domain expertise. We therefore also evaluate a possible alternative to human annotation, using large language models (LLMs) as annotators. We perform an extensive evaluation of the approach for two openly available pretrained LLMs for Norwegian, experimenting with different configurations of prompts and in-context learning, comparing their performance to human annotators. We find that even for zero-shot runs, models perform well above the baseline for binary sentiment, but still cannot compete with human annotators on the full dataset.
