Table of Contents
Fetching ...

Dodgersort: Uncertainty-Aware VLM-Guided Human-in-the-Loop Pairwise Ranking

Yujin Park, Haejun Chung, Ikbeom Jang

Abstract

Pairwise comparison labeling is emerging as it yields higher inter-rater reliability than conventional classification labeling, but exhaustive comparisons require quadratic cost. We propose Dodgersort, which leverages CLIP-based hierarchical pre-ordering, a neural ranking head and probabilistic ensemble (Elo, BTL, GP), epistemic--aleatoric uncertainty decomposition, and information-theoretic pair selection. It reduces human comparisons while improving the reliability of the rankings. In visual ranking tasks in medical imaging, historical dating, and aesthetics, Dodgersort achieves a 11--16\% annotation reduction while improving inter-rater reliability. Cross-domain ablations across four datasets show that neural adaptation and ensemble uncertainty are key to this gain. In FG-NET with ground-truth ages, the framework extracts 5--20$\times$ more ranking information per comparison than baselines, yielding Pareto-optimal accuracy--efficiency trade-offs.

Dodgersort: Uncertainty-Aware VLM-Guided Human-in-the-Loop Pairwise Ranking

Abstract

Pairwise comparison labeling is emerging as it yields higher inter-rater reliability than conventional classification labeling, but exhaustive comparisons require quadratic cost. We propose Dodgersort, which leverages CLIP-based hierarchical pre-ordering, a neural ranking head and probabilistic ensemble (Elo, BTL, GP), epistemic--aleatoric uncertainty decomposition, and information-theoretic pair selection. It reduces human comparisons while improving the reliability of the rankings. In visual ranking tasks in medical imaging, historical dating, and aesthetics, Dodgersort achieves a 11--16\% annotation reduction while improving inter-rater reliability. Cross-domain ablations across four datasets show that neural adaptation and ensemble uncertainty are key to this gain. In FG-NET with ground-truth ages, the framework extracts 5--20 more ranking information per comparison than baselines, yielding Pareto-optimal accuracy--efficiency trade-offs.
Paper Structure (16 sections, 5 equations, 3 figures, 4 tables)

This paper contains 16 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Pipeline of Dodgersort. Phase 1: VLM hierarchical pre-ordering produces coarse ranking. Phase 2: Initialize neural ranking head and probabilistic ensemble (GP, Elo, BTL). Phase 3: Uncertainty-guided MergeSort loop automates confident pairs or queries human, updating ensemble with feedback to produce final ranking $\mathcal{R}$.
  • Figure 2: Hierarchical VLM pre-ordering. Left: Prompt structure with $B$ bins. Center: CLIP assigns soft bin probabilities $\{p_{ib}\}$ via image-text cosine similarities. Right: Elo rating trajectories show that proper bin initialization effectively accelerates convergence.
  • Figure 3: Ablation study across four domains. Full system (green) achieves Pareto-superior trade-offs by balancing accuracy and human comparison cost. Key components (neural ranking head, ensemble, smart selection) each contribute to optimal performance.