Table of Contents
Fetching ...

Soft Tournament Equilibrium

Saad Alqithami

Abstract

The evaluation of general-purpose artificial agents, particularly those based on large language models, presents a significant challenge due to the non-transitive nature of their interactions. When agent A defeats B, B defeats C, and C defeats A, traditional ranking methods that force a linear ordering can be misleading and unstable. We argue that for such cyclic domains, the fundamental object of evaluation should not be a ranking but a set-valued core, as conceptualized in classical tournament theory. This paper introduces Soft Tournament Equilibrium (STE), a differentiable framework for learning and computing set-valued tournament solutions directly from pairwise comparison data. STE first learns a probabilistic tournament model, potentially conditioned on rich contextual information. It then employs novel, differentiable operators for soft reachability and soft covering to compute continuous analogues of two seminal tournament solutions: the Top Cycle and the Uncovered Set. The output is a set of core agents, each with a calibrated membership score, providing a nuanced and robust assessment of agent capabilities. We develop the theoretical foundation for STE to prove its consistency with classical solutions in the zero-temperature limit, which establishes its Condorcet-inclusion properties, and analyzing its stability and sample complexity. We specify an experimental protocol for validating STE on both synthetic and real-world benchmarks. This work aims to provide a complete, standalone treatise that re-centers general-agent evaluation on a more appropriate and robust theoretical foundation, moving from unstable rankings to stable, set-valued equilibria.

Soft Tournament Equilibrium

Abstract

The evaluation of general-purpose artificial agents, particularly those based on large language models, presents a significant challenge due to the non-transitive nature of their interactions. When agent A defeats B, B defeats C, and C defeats A, traditional ranking methods that force a linear ordering can be misleading and unstable. We argue that for such cyclic domains, the fundamental object of evaluation should not be a ranking but a set-valued core, as conceptualized in classical tournament theory. This paper introduces Soft Tournament Equilibrium (STE), a differentiable framework for learning and computing set-valued tournament solutions directly from pairwise comparison data. STE first learns a probabilistic tournament model, potentially conditioned on rich contextual information. It then employs novel, differentiable operators for soft reachability and soft covering to compute continuous analogues of two seminal tournament solutions: the Top Cycle and the Uncovered Set. The output is a set of core agents, each with a calibrated membership score, providing a nuanced and robust assessment of agent capabilities. We develop the theoretical foundation for STE to prove its consistency with classical solutions in the zero-temperature limit, which establishes its Condorcet-inclusion properties, and analyzing its stability and sample complexity. We specify an experimental protocol for validating STE on both synthetic and real-world benchmarks. This work aims to provide a complete, standalone treatise that re-centers general-agent evaluation on a more appropriate and robust theoretical foundation, moving from unstable rankings to stable, set-valued equilibria.

Paper Structure

This paper contains 163 sections, 21 theorems, 81 equations, 6 figures, 12 tables, 2 algorithms.

Key Result

Theorem 5.1

Let $T$ be a tournament satisfying the strict margin assumption (Assumption as:margin_strict) with margin $\delta > 0$. Let $t_0(a)$ be the indicator for agent $a$ being in the classical Top Cycle, and let $t_\tau(a)$ be its soft counterpart computed with temperature $\tau$ and path length $K \ge n- $\blacktriangleleft$$\blacktriangleleft$

Figures (6)

  • Figure 1: Typical set relations for tournament solutions: the Uncovered Set is contained in the Top Cycle, which is a subset of the agent set $\mathcal{A}$. (The inclusions can be strict depending on the tournament structure.)
  • Figure 2: Overview of the STE pipeline. From pairwise comparisons, STE learns a calibrated probabilistic tournament, constructs a temperature-controlled soft tournament $D_\tau$, and computes differentiable approximations of the Top Cycle and Uncovered Set to produce membership scores.
  • Figure 3: Core recovery vs. cyclicity ($\rho$). Core recovery F1 as a function of the cycle-strength parameter $\rho$ in the synthetic generator. Each plotted value aggregates the repeated runs (seeds) configured in the pipeline; numeric summaries appear in Appendix \ref{['app:results']}.
  • Figure 4: Robustness to sparsity ($\mu$). Jaccard stability of recovered cores as the observation graph becomes sparser (smaller $\mu$). Higher indicates more stable recovery across repeated runs.
  • Figure 5: Reliability diagrams (calibration). Empirical calibration of STE membership scores when interpreted as probabilities of belonging to the indicated core. The diagonal corresponds to perfect calibration.
  • ...and 1 more figures

Theorems & Definitions (50)

  • Definition 3.1: Probabilistic Tournament
  • Definition 3.2: Majority-Rule Tournament
  • Definition 3.3: Reachability in a Tournament
  • Definition 3.4: Top Cycle
  • Definition 3.5: Covering Relation
  • Definition 3.6: Uncovered Set
  • Definition 4.1: Soft Majority Edge
  • Definition 4.2: Soft Cover Score
  • Definition 4.3: Scalar Soft-Maximum
  • Theorem 5.1: Finite-Temperature Error Bound for Top Cycle
  • ...and 40 more