Table of Contents
Fetching ...

Hotter and Colder: A New Approach to Annotating Sentiment, Emotions, and Bias in Icelandic Blog Comments

Steinunn Rut Friðriksdóttir, Dan Saattrup Nielsen, Hafsteinn Einarsson

TL;DR

Hotter and Colder presents a two-phase annotation workflow that combines GPT-4o mini silver labeling with targeted human gold labeling to create a high-quality, 25-task dataset of Icelandic blog comments. The approach efficiently identifies rare harmful and nuanced behaviors across ~800,000 comments, yielding 12,232 uniquely annotated items with 19,301 total annotations, and evaluates AI–human agreement using Krippendorff’s $\alpha$ and Cohen’s $\kappa$. Results show strong intra-task reliability for some attributes (e.g., disgust, sympathy) but substantial variance for nuanced concepts like mansplaining and sarcasm, highlighting cultural-context challenges. The work provides a valuable resource and annotation platform for low-resource language content moderation, with implications for multi-task learning, ethical alignment, and environmentally-aware model development in Icelandic contexts.

Abstract

This paper presents Hotter and Colder, a dataset designed to analyze various types of online behavior in Icelandic blog comments. Building on previous work, we used GPT-4o mini to annotate approximately 800,000 comments for 25 tasks, including sentiment analysis, emotion detection, hate speech, and group generalizations. Each comment was automatically labeled on a 5-point Likert scale. In a second annotation stage, comments with high or low probabilities of containing each examined behavior were subjected to manual revision. By leveraging crowdworkers to refine these automatically labeled comments, we ensure the quality and accuracy of our dataset resulting in 12,232 uniquely annotated comments and 19,301 annotations. Hotter and Colder provides an essential resource for advancing research in content moderation and automatically detectiong harmful online behaviors in Icelandic.

Hotter and Colder: A New Approach to Annotating Sentiment, Emotions, and Bias in Icelandic Blog Comments

TL;DR

Hotter and Colder presents a two-phase annotation workflow that combines GPT-4o mini silver labeling with targeted human gold labeling to create a high-quality, 25-task dataset of Icelandic blog comments. The approach efficiently identifies rare harmful and nuanced behaviors across ~800,000 comments, yielding 12,232 uniquely annotated items with 19,301 total annotations, and evaluates AI–human agreement using Krippendorff’s and Cohen’s . Results show strong intra-task reliability for some attributes (e.g., disgust, sympathy) but substantial variance for nuanced concepts like mansplaining and sarcasm, highlighting cultural-context challenges. The work provides a valuable resource and annotation platform for low-resource language content moderation, with implications for multi-task learning, ethical alignment, and environmentally-aware model development in Icelandic contexts.

Abstract

This paper presents Hotter and Colder, a dataset designed to analyze various types of online behavior in Icelandic blog comments. Building on previous work, we used GPT-4o mini to annotate approximately 800,000 comments for 25 tasks, including sentiment analysis, emotion detection, hate speech, and group generalizations. Each comment was automatically labeled on a 5-point Likert scale. In a second annotation stage, comments with high or low probabilities of containing each examined behavior were subjected to manual revision. By leveraging crowdworkers to refine these automatically labeled comments, we ensure the quality and accuracy of our dataset resulting in 12,232 uniquely annotated comments and 19,301 annotations. Hotter and Colder provides an essential resource for advancing research in content moderation and automatically detectiong harmful online behaviors in Icelandic.

Paper Structure

This paper contains 26 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Key components of the annotation platform: (left) The landing page introducing the project and its importance; (middle) The task overview dashboard displaying user progress and available annotation tasks; (right) An example of a specific annotation task (politeness assessment) showing the comment to be annotated, contextual information, and annotation options.
  • Figure 2: Distribution of AI labels on tasks that were rated from 1 to 5 on a Likert scale.
  • Figure 3: Distribution of AI labels for the sentiment analysis task.