Table of Contents
Fetching ...

Beyond Kemeny Medians: Consensus Ranking Distributions Definition, Properties and Statistical Learning

Stephan Clémençon, Ekhine Irurozki

TL;DR

This work tackles the challenge of summarizing ranking distributions on the permutation group $\mathfrak{S}_n$ beyond single medians. It introduces Consensus Ranking Distributions (CRD), a sparse mixture of Dirac masses centered at local medians, and grounds distortion control in the Kendall $τ$ Wasserstein framework, enabling robust approximation of multimodal ranking distributions. The COAST algorithm provides a principled, top-down tree-based method to learn CRDs by recursively partitioning $\mathfrak{S}_n$ via pairwise comparisons and aggregating local medians, with statistical guarantees and efficient empirical performance. Empirical results on Mallows mixtures, anomaly detection tasks, and real data such as the Sushi dataset demonstrate that CRDs can reveal structure and reduce distortion relative to global medians, offering a scalable and interpretable summary for ranking data with practical impact in recommendation and search systems.

Abstract

In this article we develop a new method for summarizing a ranking distribution, \textit{i.e.} a probability distribution on the symmetric group $\mathfrak{S}_n$, beyond the classical theory of consensus and Kemeny medians. Based on the notion of \textit{local ranking median}, we introduce the concept of \textit{consensus ranking distribution} ($\crd$), a sparse mixture model of Dirac masses on $\mathfrak{S}_n$, in order to approximate a ranking distribution with small distortion from a mass transportation perspective. We prove that by choosing the popular Kendall $τ$ distance as the cost function, the optimal distortion can be expressed as a function of pairwise probabilities, paving the way for the development of efficient learning methods that do not suffer from the lack of vector space structure on $\mathfrak{S}_n$. In particular, we propose a top-down tree-structured statistical algorithm that allows for the progressive refinement of a CRD based on ranking data, from the Dirac mass at a Kemeny median at the root of the tree to the empirical ranking data distribution itself at the end of the tree's exhaustive growth. In addition to the theoretical arguments developed, the relevance of the algorithm is empirically supported by various numerical experiments.

Beyond Kemeny Medians: Consensus Ranking Distributions Definition, Properties and Statistical Learning

TL;DR

This work tackles the challenge of summarizing ranking distributions on the permutation group beyond single medians. It introduces Consensus Ranking Distributions (CRD), a sparse mixture of Dirac masses centered at local medians, and grounds distortion control in the Kendall Wasserstein framework, enabling robust approximation of multimodal ranking distributions. The COAST algorithm provides a principled, top-down tree-based method to learn CRDs by recursively partitioning via pairwise comparisons and aggregating local medians, with statistical guarantees and efficient empirical performance. Empirical results on Mallows mixtures, anomaly detection tasks, and real data such as the Sushi dataset demonstrate that CRDs can reveal structure and reduce distortion relative to global medians, offering a scalable and interpretable summary for ranking data with practical impact in recommendation and search systems.

Abstract

In this article we develop a new method for summarizing a ranking distribution, \textit{i.e.} a probability distribution on the symmetric group , beyond the classical theory of consensus and Kemeny medians. Based on the notion of \textit{local ranking median}, we introduce the concept of \textit{consensus ranking distribution} (), a sparse mixture model of Dirac masses on , in order to approximate a ranking distribution with small distortion from a mass transportation perspective. We prove that by choosing the popular Kendall distance as the cost function, the optimal distortion can be expressed as a function of pairwise probabilities, paving the way for the development of efficient learning methods that do not suffer from the lack of vector space structure on . In particular, we propose a top-down tree-structured statistical algorithm that allows for the progressive refinement of a CRD based on ranking data, from the Dirac mass at a Kemeny median at the root of the tree to the empirical ranking data distribution itself at the end of the tree's exhaustive growth. In addition to the theoretical arguments developed, the relevance of the algorithm is empirically supported by various numerical experiments.
Paper Structure (23 sections, 6 theorems, 64 equations, 15 figures, 1 table)

This paper contains 23 sections, 6 theorems, 64 equations, 15 figures, 1 table.

Key Result

Proposition 1

(Distortion bound) Let $\mathcal{P}$ be any partition of $\mathfrak{S}_n$ s.t. $P(\mathcal{C})>0$ for all $\mathcal{C}\in \mathcal{P}$. We have: where we set $V'(\mathcal{C})=V'_{P_{\mathcal{C}}}$.

Figures (15)

  • Figure 1: Pseudo-code for the COAST algorithm.
  • Figure 2: Results for the Local Depth and Anomaly Detection (\ref{['fig:a']}) and Mixtures of Mallows models (\ref{['fig:b']}, \ref{['fig:c']}, \ref{['fig:d']}).
  • Figure 3: DD-plots on local vs. global depth: Mallows model $n=10, k=4$
  • Figure 4: DD-plots on local vs. global depth: Mallows model $n=10, k=8$
  • Figure 5: DD-plots on local vs. global depth: Plackett-Luce for $n=10$, $k=4$ and $n=10$, $k=8$.
  • ...and 10 more figures

Theorems & Definitions (18)

  • Definition 1
  • Remark 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Proposition 1
  • proof
  • Remark 2
  • Proposition 2
  • Remark 3
  • ...and 8 more