OrdRankBen: A Novel Ranking Benchmark for Ordinal Relevance in NLP
Yan Wang, Lingfei Qian, Xueqing Peng, Jimin Huang, Dongji Feng
TL;DR
OrdRankBen introduces an ordinal relevance benchmark for NLP ranking to address the limitations of binary and continuous labels in capturing fine-grained ranking distinctions. It formalizes an ordinal ranking task, constructs two MSMARCO-derived datasets with distinct label distributions, and evaluates nine models spanning ranking-based LMs, general LLMs, and ranking-focused LLMs. The results demonstrate that ordinal-aware evaluation provides sharper discrimination of ranking quality, with general LLMs like GPT-mini and LInstruct delivering strong performance and cutoff-sensitive behavior observed in $nDCG$, highlighting the practical value of ordinal labels for real-world ranking tasks. The work releases OrdRankBen on GitHub and offers a framework for robust, fine-grained assessment of ranking systems in NLP applications.
Abstract
The evaluation of ranking tasks remains a significant challenge in natural language processing (NLP), particularly due to the lack of direct labels for results in real-world scenarios. Benchmark datasets play a crucial role in providing standardized testbeds that ensure fair comparisons, enhance reproducibility, and enable progress tracking, facilitating rigorous assessment and continuous improvement of ranking models. Existing NLP ranking benchmarks typically use binary relevance labels or continuous relevance scores, neglecting ordinal relevance scores. However, binary labels oversimplify relevance distinctions, while continuous scores lack a clear ordinal structure, making it challenging to capture nuanced ranking differences effectively. To address these challenges, we introduce OrdRankBen, a novel benchmark designed to capture multi-granularity relevance distinctions. Unlike conventional benchmarks, OrdRankBen incorporates structured ordinal labels, enabling more precise ranking evaluations. Given the absence of suitable datasets for ordinal relevance ranking in NLP, we constructed two datasets with distinct ordinal label distributions. We further evaluate various models for three model types, ranking-based language models, general large language models, and ranking-focused large language models on these datasets. Experimental results show that ordinal relevance modeling provides a more precise evaluation of ranking models, improving their ability to distinguish multi-granularity differences among ranked items-crucial for tasks that demand fine-grained relevance differentiation.
