Table of Contents
Fetching ...

Evaluating Perceptual Distance Models by Fitting Binomial Distributions to Two-Alternative Forced Choice Data

Alexander Hepburn, Raul Santos-Rodriguez, Javier Portilla

TL;DR

The paper addresses the challenge of evaluating perceptual distance models using 2AFC data in large, randomly composed datasets like BAPPS. It introduces a pure probabilistic framework that treats observer decisions as a binomial process with a distance-dependent probability $P(d_0,d_1)$, estimated via kernel-smoothed density estimation and marginal uniformisation, and cross-validated against neural-network baselines. The approach provides simple, interpretable metrics (AJ and NLL) and remains robust to varying numbers of judgements per triplet, yielding performance on par with neural networks but with far fewer parameters and training requirements. Applied to multiple perceptual distances, the method reproduces known rankings and offers richer diagnostics, supporting scalable, principled evaluation of perceptual distance models. The work also demonstrates applicability to datasets with variable $M_t$ (e.g., CLIC) and highlights the practical impact of transparent, likelihood-based evaluation for advancing perceptual similarity metrics.

Abstract

The Two Alternative Forced Choice (2AFC) paradigm offers advantages over the Mean Opinion Score (MOS) paradigm in psychophysics (PF), such as simplicity and robustness. However, when evaluating perceptual distance models, MOS enables direct correlation between model predictions and PF data. In contrast, 2AFC only allows pairwise comparisons to be converted into a quality ranking similar to MOS when comparisons include shared images. In large datasets, like BAPPS, where image patches and distortions are combined randomly, deriving rankings from 2AFC PF data becomes infeasible, as distorted images included in each comparisons are independent. To address this, instead of relying on MOS correlation, researchers have trained ad-hoc neural networks to reproduce 2AFC PF data based on pairs of model distances - a black-box approach with conceptual and operational limitations. This paper introduces a more robust distance-model evaluation method using a pure probabilistic approach, applying maximum likelihood estimation to a binomial decision model. Our method demonstrates superior simplicity, interpretability, flexibility, and computational efficiency, as shown through evaluations of various visual distance models on two 2AFC PF datasets.

Evaluating Perceptual Distance Models by Fitting Binomial Distributions to Two-Alternative Forced Choice Data

TL;DR

The paper addresses the challenge of evaluating perceptual distance models using 2AFC data in large, randomly composed datasets like BAPPS. It introduces a pure probabilistic framework that treats observer decisions as a binomial process with a distance-dependent probability , estimated via kernel-smoothed density estimation and marginal uniformisation, and cross-validated against neural-network baselines. The approach provides simple, interpretable metrics (AJ and NLL) and remains robust to varying numbers of judgements per triplet, yielding performance on par with neural networks but with far fewer parameters and training requirements. Applied to multiple perceptual distances, the method reproduces known rankings and offers richer diagnostics, supporting scalable, principled evaluation of perceptual distance models. The work also demonstrates applicability to datasets with variable (e.g., CLIC) and highlights the practical impact of transparent, likelihood-based evaluation for advancing perceptual similarity metrics.

Abstract

The Two Alternative Forced Choice (2AFC) paradigm offers advantages over the Mean Opinion Score (MOS) paradigm in psychophysics (PF), such as simplicity and robustness. However, when evaluating perceptual distance models, MOS enables direct correlation between model predictions and PF data. In contrast, 2AFC only allows pairwise comparisons to be converted into a quality ranking similar to MOS when comparisons include shared images. In large datasets, like BAPPS, where image patches and distortions are combined randomly, deriving rankings from 2AFC PF data becomes infeasible, as distorted images included in each comparisons are independent. To address this, instead of relying on MOS correlation, researchers have trained ad-hoc neural networks to reproduce 2AFC PF data based on pairs of model distances - a black-box approach with conceptual and operational limitations. This paper introduces a more robust distance-model evaluation method using a pure probabilistic approach, applying maximum likelihood estimation to a binomial decision model. Our method demonstrates superior simplicity, interpretability, flexibility, and computational efficiency, as shown through evaluations of various visual distance models on two 2AFC PF datasets.
Paper Structure (27 sections, 9 equations, 10 figures, 8 tables)

This paper contains 27 sections, 9 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: An example using 5 data points, $M=2$ for each $j=\{0, 1, 2\}$. (a) Samples with a Gaussian kernel applied (the circle represents the standard deviation) and a $10\times10$ grid for which we estimate Eq. \ref{['eq:conditional']}. (b), (c) and (d) are the estimated conditional distributions for each value of $j$ with $M=2$, and (e) is the distribution after maximum likelihood estimation according to Eq. \ref{['eq:maximised_likelihood']}.
  • Figure 2: Binomial parameter $P$ estimated from the BAPPS training set for different perceptual distance models using (a) Density estimation and (b) Neural network.
  • Figure 3: Example of valuating the negative log-likelihood $j=[0, 5]$ according to DISTS for a triplet from the BAPPS test set where one image $\mathbf{x}_0$ is close to the reference $\mathbf{x}_{\text{ref}}$. $j=0$ denotes 5 participants selecting $\mathbf{x}_0$ as closer to $\mathbf{x}_{\text{ref}}$. For NLL, White is more likely and blue is less likely.
  • Figure 4: Variability of the density estimation method with relation to (a) the width of the Gaussian kernel $\sigma$ and (b) the number of partitions in the grid used to estimate $\hat{P}(d_0, d_1)$. In each subplot, left is NLL (Eq. \ref{['eq:maximised_likelihood']}), right is AJ (Eq. \ref{['eq:agreement']}).
  • Figure 5: Scatter plot of candidate distances in their original space (top row) and uniformised (bottom row). Shown are the training samples from the BAPPS dataset and the colour indicates the judgement assigned to triplet according to 2 observers.
  • ...and 5 more figures