Table of Contents
Fetching ...

Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations

Xu Han, Felix Yu, Joao Sedoc, Benjamin Van Durme

TL;DR

The paper addresses the costly yet crucial task of obtaining robust scalar rankings over large item sets for learning-to-rank applications. It introduces Iterated Best-Worst Scaling (IBWS) to produce high-quality relative annotations and evaluates direct scalar methods, identifying a simple slider protocol as a cost-efficient alternative that closely matches IBWS. By training RoBERTa-based LTR models on slider-derived annotations, the study demonstrates effective ranking in sentiment analysis and dialogue evaluation, with strong correlations to IBWS ground truth and practical reductions in annotation effort. The findings offer a scalable, empirically validated approach for reliable annotation pipelines and scalable ranking systems in NLP tasks such as sentiment analysis and dialogue systems.

Abstract

Our goal is a mechanism for efficiently assigning scalar ratings to each of a large set of elements. For example, "what percent positive or negative is this product review?" When sample sizes are small, prior work has advocated for methods such as Best Worst Scaling (BWS) as being more robust than direct ordinal annotation ("Likert scales"). Here we first introduce IBWS, which iteratively collects annotations through Best-Worst Scaling, resulting in robustly ranked crowd-sourced data. While effective, IBWS is too expensive for large-scale tasks. Using the results of IBWS as a best-desired outcome, we evaluate various direct assessment methods to determine what is both cost-efficient and best correlating to a large scale BWS annotation strategy. Finally, we illustrate in the domains of dialogue and sentiment how these annotations can support robust learning-to-rank models.

Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations

TL;DR

The paper addresses the costly yet crucial task of obtaining robust scalar rankings over large item sets for learning-to-rank applications. It introduces Iterated Best-Worst Scaling (IBWS) to produce high-quality relative annotations and evaluates direct scalar methods, identifying a simple slider protocol as a cost-efficient alternative that closely matches IBWS. By training RoBERTa-based LTR models on slider-derived annotations, the study demonstrates effective ranking in sentiment analysis and dialogue evaluation, with strong correlations to IBWS ground truth and practical reductions in annotation effort. The findings offer a scalable, empirically validated approach for reliable annotation pipelines and scalable ranking systems in NLP tasks such as sentiment analysis and dialogue systems.

Abstract

Our goal is a mechanism for efficiently assigning scalar ratings to each of a large set of elements. For example, "what percent positive or negative is this product review?" When sample sizes are small, prior work has advocated for methods such as Best Worst Scaling (BWS) as being more robust than direct ordinal annotation ("Likert scales"). Here we first introduce IBWS, which iteratively collects annotations through Best-Worst Scaling, resulting in robustly ranked crowd-sourced data. While effective, IBWS is too expensive for large-scale tasks. Using the results of IBWS as a best-desired outcome, we evaluate various direct assessment methods to determine what is both cost-efficient and best correlating to a large scale BWS annotation strategy. Finally, we illustrate in the domains of dialogue and sentiment how these annotations can support robust learning-to-rank models.
Paper Structure (34 sections, 1 equation, 12 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 1 equation, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Direct assessment protocols for sentiment.
  • Figure 2: BWS protocol on Amazon review sentiment.
  • Figure 3: Vert-drag BWS interface.
  • Figure 4: An illustration of IBWS algorithm.
  • Figure 5: Likert Style, dual-question Protocols.
  • ...and 7 more figures