Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations

Xu Han; Felix Yu; Joao Sedoc; Benjamin Van Durme

Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations

Xu Han, Felix Yu, Joao Sedoc, Benjamin Van Durme

TL;DR

The paper addresses the costly yet crucial task of obtaining robust scalar rankings over large item sets for learning-to-rank applications. It introduces Iterated Best-Worst Scaling (IBWS) to produce high-quality relative annotations and evaluates direct scalar methods, identifying a simple slider protocol as a cost-efficient alternative that closely matches IBWS. By training RoBERTa-based LTR models on slider-derived annotations, the study demonstrates effective ranking in sentiment analysis and dialogue evaluation, with strong correlations to IBWS ground truth and practical reductions in annotation effort. The findings offer a scalable, empirically validated approach for reliable annotation pipelines and scalable ranking systems in NLP tasks such as sentiment analysis and dialogue systems.

Abstract

Our goal is a mechanism for efficiently assigning scalar ratings to each of a large set of elements. For example, "what percent positive or negative is this product review?" When sample sizes are small, prior work has advocated for methods such as Best Worst Scaling (BWS) as being more robust than direct ordinal annotation ("Likert scales"). Here we first introduce IBWS, which iteratively collects annotations through Best-Worst Scaling, resulting in robustly ranked crowd-sourced data. While effective, IBWS is too expensive for large-scale tasks. Using the results of IBWS as a best-desired outcome, we evaluate various direct assessment methods to determine what is both cost-efficient and best correlating to a large scale BWS annotation strategy. Finally, we illustrate in the domains of dialogue and sentiment how these annotations can support robust learning-to-rank models.

Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations

TL;DR

Abstract

Paper Structure (34 sections, 1 equation, 12 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 1 equation, 12 figures, 4 tables, 1 algorithm.

Introduction
Background
Direct Assessment
Pairwise Ranking
Best-Worst Scaling (BWS)
Methods
Iterated Best-Worst Scaling
two-column BWS interface
vertical-drag BWS interface
Learning-to-Rank Model
Pair Group Strategy
Experiments
Data
Collecting Annotations
Direct Assessment
...and 19 more sections

Figures (12)

Figure 1: Direct assessment protocols for sentiment.
Figure 2: BWS protocol on Amazon review sentiment.
Figure 3: Vert-drag BWS interface.
Figure 4: An illustration of IBWS algorithm.
Figure 5: Likert Style, dual-question Protocols.
...and 7 more figures

Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations

TL;DR

Abstract

Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations

Authors

TL;DR

Abstract

Table of Contents

Figures (12)