Optimal level set estimation for non-parametric tournament and crowdsourcing problems

Maximilian Graf; Alexandra Carpentier; Nicolas Verzelen

Optimal level set estimation for non-parametric tournament and crowdsourcing problems

Maximilian Graf, Alexandra Carpentier, Nicolas Verzelen

TL;DR

The paper addresses optimal level-set estimation in non-parametric tournament and crowdsourcing settings where the data matrix $M$ is bi-isotonic up to row/column permutations. It introduces SoHLoB, a polynomial-time algorithm that localizes large entries via a hierarchical ranking framework built on envelopes and multiple noisy views, achieving minimax-optimal rates for the classification loss and permutation loss up to polylog factors. A key contribution is showing minimax lower bounds that match the algorithmic guarantees, and extending the approach to multiple thresholds and finite-valued matrices, thereby indicating no computational gap in these regimes. The work also connects to noisy-sorting literature and provides a detailed algorithmic and analytical blueprint (Envelope, ScanAndUpdate, hierarchical sorting tree) for efficient level-set recovery with practical impact on allocating workers to questions in crowdsourcing and ranking players in tournaments.

Abstract

Motivated by crowdsourcing, we consider a problem where we partially observe the correctness of the answers of $n$ experts on $d$ questions. In this paper, we assume that both the experts and the questions can be ordered, namely that the matrix $M$ containing the probability that expert $i$ answers correctly to question $j$ is bi-isotonic up to a permutation of it rows and columns. When $n=d$, this also encompasses the strongly stochastic transitive (SST) model from the tournament literature. Here, we focus on the relevant problem of deciphering small entries of $M$ from large entries of $M$, which is key in crowdsourcing for efficient allocation of workers to questions. More precisely, we aim at recovering a (or several) level set $p$ of the matrix up to a precision $h$, namely recovering resp. the sets of positions $(i,j)$ in $M$ such that $M_{ij}>p+h$ and $M_{i,j}<p-h$. We consider, as a loss measure, the number of misclassified entries. As our main result, we construct an efficient polynomial-time algorithm that turns out to be minimax optimal for this classification problem. This heavily contrasts with existing literature in the SST model where, for the stronger reconstruction loss, statistical-computational gaps have been conjectured. More generally, this shades light on the nature of statistical-computational gaps for permutations models.

Optimal level set estimation for non-parametric tournament and crowdsourcing problems

TL;DR

The paper addresses optimal level-set estimation in non-parametric tournament and crowdsourcing settings where the data matrix

is bi-isotonic up to row/column permutations. It introduces SoHLoB, a polynomial-time algorithm that localizes large entries via a hierarchical ranking framework built on envelopes and multiple noisy views, achieving minimax-optimal rates for the classification loss and permutation loss up to polylog factors. A key contribution is showing minimax lower bounds that match the algorithmic guarantees, and extending the approach to multiple thresholds and finite-valued matrices, thereby indicating no computational gap in these regimes. The work also connects to noisy-sorting literature and provides a detailed algorithmic and analytical blueprint (Envelope, ScanAndUpdate, hierarchical sorting tree) for efficient level-set recovery with practical impact on allocating workers to questions in crowdsourcing and ranking players in tournaments.

Abstract

Motivated by crowdsourcing, we consider a problem where we partially observe the correctness of the answers of

experts on

questions. In this paper, we assume that both the experts and the questions can be ordered, namely that the matrix

containing the probability that expert

answers correctly to question

is bi-isotonic up to a permutation of it rows and columns. When

, this also encompasses the strongly stochastic transitive (SST) model from the tournament literature. Here, we focus on the relevant problem of deciphering small entries of

from large entries of

, which is key in crowdsourcing for efficient allocation of workers to questions. More precisely, we aim at recovering a (or several) level set

of the matrix up to a precision

, namely recovering resp. the sets of positions

such that

and

. We consider, as a loss measure, the number of misclassified entries. As our main result, we construct an efficient polynomial-time algorithm that turns out to be minimax optimal for this classification problem. This heavily contrasts with existing literature in the SST model where, for the stronger reconstruction loss, statistical-computational gaps have been conjectured. More generally, this shades light on the nature of statistical-computational gaps for permutations models.

Paper Structure (66 sections, 24 theorems, 245 equations, 5 figures, 5 algorithms)

This paper contains 66 sections, 24 theorems, 245 equations, 5 figures, 5 algorithms.

Introduction
Localizing large entries of $M$
Our contribution
Organization and notation
Preliminaries
Problem formulation
Permutation loss and reduction
Main results
Minimax lower bound
Permutation and classification matrix estimation
Description of the ranking algorithm
Intuition behind $\texttt{SoHLoB}$
Active set $Q^*(E)$ of questions for a set $E$ of experts.
Intuition behind the rate $\sigma^2 (n\vee d)/(\lambda_0 h^2)$.
Definition of the estimators and preliminaries
...and 51 more sections

Key Result

Theorem 3.1

There exist universal constants $c,c',c">0$, such that the following holds for any $\sigma>0$, $\lambda_0$, $p\in [0,1]$, and $h\in (0,\min(p,1-p))$, and $n$, $d$ such that $n\vee d\geq 2$. If $\lambda_0 h^2\leq c\sigma^2$, then

Figures (5)

Figure 1: Illustration of two bi-isotonic matrices $M\in\mathbb{C_{\mathrm{Biso}}}(\mathrm{id}_{\lbrack n\rbrack},\mathrm{id}_{\lbrack d\rbrack})$ so that $\pi=\mathrm{id}_{\lbrack n\rbrack}$ and $\eta= \mathrm{id}_{\lbrack d\rbrack}$. These matrices take two values $p+h$ (red) and $p-h$ (blue).
Figure 2: Illustration of some bi-isotonic matrix $M\in \mathbb{C}_{\mathrm{Biso}}(\mathrm{id}_{\lbrack n\rbrack},\mathrm{id}_{\lbrack d\rbrack})$ so that $\pi=\mathrm{id}_{\lbrack n\rbrack}$ and $\eta= \mathrm{id}_{\lbrack d\rbrack}$. The two purple curves divide the matrices into three areas: values, that are at least $p+h$ (light red background); values, that are at most $p-h$ (light blue background); and values that are between $p+h$ and $p-h$ (violet background). We construct sets $Q'$, first using Envelope (\ref{['Subfig:Envelopes']}) and then in ScanAndUpdate (\ref{['Subfig:Scan']}).
Figure 3: Illustration of a sorting tree with $K=3$ iterations. The red nodes correspond to active sets, on which we apply Algorithm GraphTrisect, the trisection. The blue nodes correspond to passive sets, which are carried over unchanged into all further levels of the tree.
Figure 4: Illustration of a part of $M_{E,Q}$. The dotted curves separate two areas of the matrix: That where the entries are at least $p+h$ (top left) and that where they are at most $p-h$ (bottom right). The curve in between separates values that are at least $p$ from values that are smaller than $p$. Question $\overline j$ is the "last" question for which the median expert $\overline{i}$ has value at least $p$. Our algorithm relies on two detection steps: First, we use $\tilde{Y}^{(a)}$ and the corresponding column averages $\overline{y}_{E}\left(j\right)$, to detect areas of interest left and right of $\overline j$ (see \ref{['Eq:DefSubsets']} for a definition of these sets). Then, we detect from $\tilde{Y}^{(b)}$, whether an expert is above or below $\overline i$ and for that purpose, we focus on the following sets of questions: First, $A$ corresponds to the questions for which we cannot detect from our observation, whether they are left or right of $\overline j$. Second, $\underline{R}$ contains questions that are provably right $\overline{j}$, but the size of $\underline{R}$ is too small for reliably detecting experts above $\overline{i}$ from the given observation $\tilde{Y}^{(b)}$. Though, we can detect those relying on averages on the larger sets of questions $R$ and $\overline R$. As a consequence, experts that differ from $\overline i$ on this whole areas (in particular have all values at least $p+h$) are assigned to $\overline{O}$ by \ref{['Def:Conservative_Trisection']}. Following this, experts that remained in $\overline{P}$ cannot "perform better" than $p+h$ on every question in $R\setminus \underline{R}$ and consequently there exists $j_R\in R$ with $M_{ij_R}<p+h$ for all $i\in \overline P$. By the bi-isotonicity, only questions $j< j_R$ can contribute to the error we want to bound, which is why we extend our analysis to $\Delta_R=\overline R\setminus R$.
Figure 5: Illustration of some bi-isotonic matrix $M\in \mathbb{C}_{\mathrm{Biso}}(\mathrm{id}_{\lbrack n\rbrack},\mathrm{id}_{\lbrack d\rbrack})$. The part in the top left (light red background) corresponds to matrix values $\geq p+h$, the part on the bottom right (light blue background) to matrix values $\leq p-h$ and the area in between (purple background) to values in $(p-h,p+h)$. Assume we have $E_{\underline s}$, $E_s$, $E_{\overline{s}}\in \mathcal{E}$ such that each of the sets is larger than $4\rho^2/\lambda_1 h^2$ and $\underline i < i<\overline i$ for $\underline i\in E_{\underline s}$, $i\in E_s$ and $\overline i \in E_{\overline s}$. We are interested in estimating the questions $Q^*(E_s)$, which correspond to the gray area. To do so, we detect sets that contain all questions on which the success-probabilities are at least $p+h$ for all experts in $E_{\underline s}$ (red dashed area) and at most $p-h$ for all experts in $E_{\overline s}$ (blue dashed area). Intersecting these sets yields $Q_s$.

Theorems & Definitions (46)

Theorem 3.1
Theorem 3.2
Corollary 3.3
Remark 3.4
Theorem 3.5
Proposition 5.1
Lemma A.1
Theorem A.2
proof : Proof of Theorem \ref{['Thm:ErrorBound']}
Lemma B.1
...and 36 more

Optimal level set estimation for non-parametric tournament and crowdsourcing problems

TL;DR

Abstract

Optimal level set estimation for non-parametric tournament and crowdsourcing problems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (46)