Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations
Xu Han, Felix Yu, Joao Sedoc, Benjamin Van Durme
TL;DR
The paper addresses the costly yet crucial task of obtaining robust scalar rankings over large item sets for learning-to-rank applications. It introduces Iterated Best-Worst Scaling (IBWS) to produce high-quality relative annotations and evaluates direct scalar methods, identifying a simple slider protocol as a cost-efficient alternative that closely matches IBWS. By training RoBERTa-based LTR models on slider-derived annotations, the study demonstrates effective ranking in sentiment analysis and dialogue evaluation, with strong correlations to IBWS ground truth and practical reductions in annotation effort. The findings offer a scalable, empirically validated approach for reliable annotation pipelines and scalable ranking systems in NLP tasks such as sentiment analysis and dialogue systems.
Abstract
Our goal is a mechanism for efficiently assigning scalar ratings to each of a large set of elements. For example, "what percent positive or negative is this product review?" When sample sizes are small, prior work has advocated for methods such as Best Worst Scaling (BWS) as being more robust than direct ordinal annotation ("Likert scales"). Here we first introduce IBWS, which iteratively collects annotations through Best-Worst Scaling, resulting in robustly ranked crowd-sourced data. While effective, IBWS is too expensive for large-scale tasks. Using the results of IBWS as a best-desired outcome, we evaluate various direct assessment methods to determine what is both cost-efficient and best correlating to a large scale BWS annotation strategy. Finally, we illustrate in the domains of dialogue and sentiment how these annotations can support robust learning-to-rank models.
