Table of Contents
Fetching ...

Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

Yinglun Zhu, Jiancheng Zhang, Fuzhi Tang

TL;DR

This work argues that common evaluation metrics underestimate multimodal models' compositional reasoning capabilities. It introduces GroupMatch, an alternate metric that assesses the best total matching within groups, and shows that Simple Match can transfer gains to the standard GroupScore, revealing substantial hidden capabilities. Building on this, the authors propose Test-Time Matching (TTM), an iterative, self-supervised self-improvement procedure that uses group-induced pseudo-labels to boost performance without external data, achieving new SOTA on several benchmarks and even surpassing GPT-4.1 on MMVP-VLM and ColorSwap in particular settings. The approach demonstrates robust improvements across 16 dataset variants, including non-grouped and global matching scenarios, highlighting both the impact of evaluation design on perceived capability and the practical potential of test-time, matching-based self-training for advancing compositional reasoning.

Abstract

Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To address this, we introduce a group matching score that better exploits group structure and reveals substantial hidden capability in both contrastive vision-language models (VLMs) and multimodal large language models (MLLMs). Moreover, simply overfitting to the induced group matchings at test time transfers this hidden capability into higher scores under standard evaluation metrics, closing much of the reported gap. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.

Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

TL;DR

This work argues that common evaluation metrics underestimate multimodal models' compositional reasoning capabilities. It introduces GroupMatch, an alternate metric that assesses the best total matching within groups, and shows that Simple Match can transfer gains to the standard GroupScore, revealing substantial hidden capabilities. Building on this, the authors propose Test-Time Matching (TTM), an iterative, self-supervised self-improvement procedure that uses group-induced pseudo-labels to boost performance without external data, achieving new SOTA on several benchmarks and even surpassing GPT-4.1 on MMVP-VLM and ColorSwap in particular settings. The approach demonstrates robust improvements across 16 dataset variants, including non-grouped and global matching scenarios, highlighting both the impact of evaluation design on perceived capability and the practical potential of test-time, matching-based self-training for advancing compositional reasoning.

Abstract

Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To address this, we introduce a group matching score that better exploits group structure and reveals substantial hidden capability in both contrastive vision-language models (VLMs) and multimodal large language models (MLLMs). Moreover, simply overfitting to the induced group matchings at test time transfers this hidden capability into higher scores under standard evaluation metrics, closing much of the reported gap. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.

Paper Structure

This paper contains 46 sections, 6 theorems, 19 equations, 4 figures, 8 tables, 1 algorithm.

Key Result

Proposition 1

For random similarity scores $s \in \mathbb{R}^{k \times k}$, $\mathbb{P}({\mathsf{GroupScore}}(s) = 1) = \frac{(k-1)!}{(2k-1)!}$.

Figures (4)

  • Figure 1: $\mathsf{Simple Match}$ and $\mathsf{TTM}$ substantially improve VLM and MLLM performance on compositional reasoning benchmarks Winoground, MMVP-VLM, and ColorSwap, achieving new performance records. We highlight: (1) $\mathsf{Simple Match}$ enables GPT-4.1 to surpass human performance on Winoground (left), and (2) $\mathsf{TTM}$ enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art (middle).
  • Figure 2: Left and middle: Matching results across different thresholds on Winoground and SugarCrepe (the Replace Relation subset) with SigLIP-B16. Right: Performance of $\mathsf{TTM}$ under different threshold schedules on Winoground with SigLIP-B16. Baseline denotes model performance without $\mathsf{TTM}$ (under ${\mathsf{GroupMatch}}$). Constant applies $\mathsf{TTM}$ with a fixed threshold $\tau_t = 2.0$. Ascend applies $\mathsf{TTM}$ with a linearly increasing schedule from $\tau_1 = 0$ to $\tau_T = 2.0$, but yields no gains as the model quickly overfits to all pseudo-labels in the first iteration. Decay applies $\mathsf{TTM}$ with a linearly decreasing schedule from $\tau_1 = 2.0$ to $\tau_T = 0$, yielding the best performance.
  • Figure 3: $\mathsf{TTM}$ results on benchmarks without metric-induced boosts: for $1 \times k$ groups, ${\mathsf{GroupMatch}}$ (and thus $\mathsf{Simple Match}$) coincide with ${\mathsf{GroupScore}}$. Left: results on four SugarCrepe subsets consisting of $1 \times 2$ groups. Middle: results on both WhatsUp subsets consisting of $1 \times 4$ groups.
  • Figure 4: Left: Raw performance of CLIP-B16 and SigLIP-B16 on Winoground under different evaluation metrics. Middle: Skyline performance of $\mathsf{TTM}$ with oracle matching on Winoground with SigLIP-B16, illustrating the upper bound achievable by $\mathsf{TTM}$. Right: Effect of the initial threshold $\tau_1$ on $\mathsf{TTM}$ performance, evaluated on Winoground with SigLIP-B16.

Theorems & Definitions (11)

  • Proposition 1
  • Proposition 2
  • Remark 1
  • Proposition 2
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • ...and 1 more