Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models

Raphael Tang; Xinyu Zhang; Xueguang Ma; Jimmy Lin; Ferhan Ture

Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models

Raphael Tang, Xinyu Zhang, Xueguang Ma, Jimmy Lin, Ferhan Ture

TL;DR

This work tackles positional bias in LLM-driven listwise ranking by introducing permutation self-consistency (PSC): it generates multiple rankings by randomly shuffling the input list, then aggregates them with a central ranking that minimizes Kendall tau distance to the samples. The authors prove consistency of the central ranking under noisy observations and validate PSC across sorting and passage reranking tasks on multiple models, showing meaningful gains over conventional inference. Empirically, PSC yields robust improvements, particularly for smaller models, and consistently outperforms alternative aggregation methods like reciprocal rank fusion. The method is parallelizable and practical, offering a principled, order-invariant decoding strategy for improving listwise ranking in black-box LLMs.

Abstract

Large language models (LLMs) exhibit positional bias in how they use context, which especially complicates listwise ranking. To address this, we propose permutation self-consistency, a form of self-consistency over ranking list outputs of black-box LLMs. Our key idea is to marginalize out different list orders in the prompt to produce an order-independent ranking with less positional bias. First, given some input prompt, we repeatedly shuffle the list in the prompt and pass it through the LLM while holding the instructions the same. Next, we aggregate the resulting sample of rankings by computing the central ranking closest in distance to all of them, marginalizing out prompt order biases in the process. Theoretically, we prove the robustness of our method, showing convergence to the true ranking in the presence of random perturbations. Empirically, on five list-ranking datasets in sorting and passage reranking, our approach improves scores from conventional inference by up to 7-18% for GPT-3.5 and 8-16% for LLaMA v2 (70B), surpassing the previous state of the art in passage reranking. Our code is at https://github.com/castorini/perm-sc.

Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models

TL;DR

Abstract

Paper Structure (23 sections, 4 theorems, 13 equations, 8 figures, 9 tables)

This paper contains 23 sections, 4 theorems, 13 equations, 8 figures, 9 tables.

Introduction
Our Approach
Preliminaries
Permutation Self-Consistency
Theoretical Guarantees
Experiments
Sorting Tasks
Passage Reranking Task
Sensitivity Analyses
Hyperparameter Studies
Rank Aggregation Comparison
Related Work and Future Directions
Conclusions
Proofs of Propositions
Detailed Experimental Setup
...and 8 more sections

Key Result

Proposition 2.1

Let there be a true ranking $\sigma$ and a sequence of i.i.d. uniformly noisy rankings $\hat{\boldsymbol\sigma} := \{\hat{\sigma}_i\}_{i=1}^m$. Suppose each noisy ranking $\hat{\sigma}_k$ has a uniformly random, nonempty concordant subset $S'_k$ with $\sigma$, and the remaining rank elements not in

Figures (8)

Figure 1: The conventional decoding process for listwise ranking with input prompt (a), language model (c), and output ranking (d). The grey item (b) is "lost in the middle" by the LLM, resulting in its misranking (e).
Figure 2: Our permutation self-consistency process. With the instruction fixed, we shuffle the input list for prompts (a), producing outputs with different mistakes. We aggregate (b) these output rankings into one (c).
Figure 3: The distribution of sorting task scores from twenty individual runs plotted against our PSC score. Our PSC outperforms the best of any individual run.
Figure 4: Distribution of "reversions" after reranking. Blues are below the observed dataset average and reds above the average. For two input list positions $i \in [1, 20]$ and $j \in (i, 20]$, $i$ indexes the rows and $j$ the columns. For example, the cell at $(1, 2)$ is the reversion of the first two input items across the dataset. Note that highly saturated colors indicate over- and under-reversion relative to other pairs in the dataset rather than in the absolute sense.
Figure 5: Quality for all datasets for various aggregate sizes and temperatures. For output rankings, we use $m=20$ as our frame of reference; for temperature, $0.0$. In the subfigure captions, $\rho$ denotes Spearman's rank correlation.
...and 3 more figures

Theorems & Definitions (7)

Definition 2.1
Proposition 2.1
Proposition 2.2
Proposition A.1: 2.1
proof
Proposition A.2: 2.2
proof

Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models

TL;DR

Abstract

Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (7)