Unsupervised Contrast-Consistent Ranking with Language Models

Niklas Stoehr; Pengxiang Cheng; Jing Wang; Daniel Preotiuc-Pietro; Rajarshi Bhowmik

Unsupervised Contrast-Consistent Ranking with Language Models

Niklas Stoehr, Pengxiang Cheng, Jing Wang, Daniel Preotiuc-Pietro, Rajarshi Bhowmik

TL;DR

The paper tackles the problem of reliably extracting ranking knowledge from language models without supervision, showing that prompting alone can yield inconsistent rankings. It extends the unsupervised Contrast-Consistent Search (CCS) framework to Contrast-Consistent Ranking (CCR), proposing Pairwise CCR, Pointwise CCR, and Listwise CCR with corresponding loss formulations (e.g., MarginCCR, TripletCCR, OrdRegCCR). Across multiple encoder/decoder models and six ranking datasets, CCR probing often outperforms prompting for smaller models and matches prompting performance for larger models, while offering greater control and interpretability. The work demonstrates that unsupervised probing can yield robust, direction-invariant rankings and provides a foundation for more reliable in-context ranking applications in NLP systems.

Abstract

Language models contain ranking-based knowledge and are powerful solvers of in-context ranking tasks. For instance, they may have parametric knowledge about the ordering of countries by size or may be able to rank product reviews by sentiment. We compare pairwise, pointwise and listwise prompting techniques to elicit a language model's ranking knowledge. However, we find that even with careful calibration and constrained decoding, prompting-based techniques may not always be self-consistent in the rankings they produce. This motivates us to explore an alternative approach that is inspired by an unsupervised probing method called Contrast-Consistent Search (CCS). The idea is to train a probe guided by a logical constraint: a language model's representation of a statement and its negation must be mapped to contrastive true-false poles consistently across multiple statements. We hypothesize that similar constraints apply to ranking tasks where all items are related via consistent, pairwise or listwise comparisons. To this end, we extend the binary CCS method to Contrast-Consistent Ranking (CCR) by adapting existing ranking methods such as the Max-Margin Loss, Triplet Loss and an Ordinal Regression objective. Across different models and datasets, our results confirm that CCR probing performs better or, at least, on a par with prompting.

Unsupervised Contrast-Consistent Ranking with Language Models

TL;DR

Abstract

Paper Structure (30 sections, 8 equations, 6 figures, 4 tables)

This paper contains 30 sections, 8 equations, 6 figures, 4 tables.

Introduction
Prompting for Rankings
Pairwise Prompting.
Pointwise Prompting.
Listwise Prompting.
Unsupervised Probing for Rankings
Contrast-Consistent Search (CCS).
From Yes-No Questions to Rankings.
Pairwise CCR Probing
Pointwise CCR Probing
Listwise CCR Probing
Experimental Design
Language Models
Ranking Task Datasets
Fact-based Ranking Tasks.
...and 15 more sections

Figures (6)

Figure 1: We study pairwise, pointwise, and listwise prompting and probing for unsupervised ranking.
Figure 2: We translate the two aspects of consistency and confidence from the binary CCS objective to an ordinal multi-class setting resulting in OrdRegCCR.
Figure 3: Pairwise and listwise results of the prompting and CCR probing methods for the DeBERTa, GPT-2 and MPT-7B model, meaned over all fact-based and context-based learning datasets. Results show mean and standard deviation over 5.0 runs. We find that CCR probing often outperforms prompting for the same-size model. Among the CCR probing methods, TripletCCR is the best-performing. Orange bars represent ceilings of a supervised probe trained and tested on the same ranking task. As model size increases (MPT-7B), prompting performance improves.
Figure 4: CCR probing offers interpretability benefits such as the post-hoc analysis of the probe's parameters. The gray scale hue of the individual dots represents the ground truth ranking of the respective items.
Figure 5: Mean ranking results and standard deviation for all methods and datasets over 5.0 runs.
...and 1 more figures

Unsupervised Contrast-Consistent Ranking with Language Models

TL;DR

Abstract

Unsupervised Contrast-Consistent Ranking with Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)