Table of Contents
Fetching ...

Investigating the Nature of Disagreements on Mid-Scale Ratings: A Case Study on the Abstractness-Concreteness Continuum

Urban Knupleš, Diego Frassinelli, Sabine Schulte im Walde

TL;DR

The paper addresses why mid-scale concreteness judgments are inconsistently rated and how such norms are used in NLP. It combines multi-modal feature analyses (sense perception, emotion, frequency, ambiguity, and free associations) with supervised classification to distinguish mid-scale from extreme concreteness and applies k-means to per-rater distributions to uncover systematic disagreement patterns. The findings show mid-scale words are genuine intermediates with distinct perceptual and emotional signatures, and that mean mid-scale ratings mask diverse underlying disagreement patterns. Practically, the work suggests filtering or fine-tuning mid-scale targets to improve their reliability in language-modeling and annotation tasks, highlighting a nuanced view of disagreement as informative rather than purely noisy.

Abstract

Humans tend to strongly agree on ratings on a scale for extreme cases (e.g., a CAT is judged as very concrete), but judgements on mid-scale words exhibit more disagreement. Yet, collected rating norms are heavily exploited across disciplines. Our study focuses on concreteness ratings and (i) implements correlations and supervised classification to identify salient multi-modal characteristics of mid-scale words, and (ii) applies a hard clustering to identify patterns of systematic disagreement across raters. Our results suggest to either fine-tune or filter mid-scale target words before utilising them.

Investigating the Nature of Disagreements on Mid-Scale Ratings: A Case Study on the Abstractness-Concreteness Continuum

TL;DR

The paper addresses why mid-scale concreteness judgments are inconsistently rated and how such norms are used in NLP. It combines multi-modal feature analyses (sense perception, emotion, frequency, ambiguity, and free associations) with supervised classification to distinguish mid-scale from extreme concreteness and applies k-means to per-rater distributions to uncover systematic disagreement patterns. The findings show mid-scale words are genuine intermediates with distinct perceptual and emotional signatures, and that mean mid-scale ratings mask diverse underlying disagreement patterns. Practically, the work suggests filtering or fine-tuning mid-scale targets to improve their reliability in language-modeling and annotation tasks, highlighting a nuanced view of disagreement as informative rather than purely noisy.

Abstract

Humans tend to strongly agree on ratings on a scale for extreme cases (e.g., a CAT is judged as very concrete), but judgements on mid-scale words exhibit more disagreement. Yet, collected rating norms are heavily exploited across disciplines. Our study focuses on concreteness ratings and (i) implements correlations and supervised classification to identify salient multi-modal characteristics of mid-scale words, and (ii) applies a hard clustering to identify patterns of systematic disagreement across raters. Our results suggest to either fine-tune or filter mid-scale target words before utilising them.
Paper Structure (19 sections, 13 figures, 8 tables)

This paper contains 19 sections, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Croissant plots -- Mean concreteness scores and standard deviations of ratings in BrysbaertEtAl:14.
  • Figure 2: Mean noun ratings and standard deviations overlaid with the respective sense perception scores.
  • Figure 3: Results of classifications across characteristics and mid-scale/extreme experiments. The dotted and horizontal line patterns indicate the amount of abstract and concrete nouns correctly classified.
  • Figure 4: SHAP values -- Importance of each feature for the output of the binary$_{mid/concrete}$ model (on the left) and the binary$_{mid/abstract}$ model (on the right). Extreme nouns are coded as negative, mid-scale nouns as positive.
  • Figure 5: $k$-Means clustering ($k=3$) of 500 mid-scale nouns based on original individual per-participant rating distributions. Cluster sizes are 170, 163, and 167. The heatmap shows the rating distributions of the centroid vectors.
  • ...and 8 more figures