Table of Contents
Fetching ...

CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

Yongkang Du, Xiaohan Zou, Minhao Cheng, Lu Lin

Abstract

Analogical reasoning tests a fundamental aspect of human cognition: mapping the relation from one pair of objects to another. Existing evaluations of this ability in multimodal large language models (MLLMs) overlook the ability to compose rules from multiple sources, a critical component of higher-order intelligence. To close this gap, we introduce CARV (Compositional Analogical Reasoning in Vision), a novel task together with a 5,500-sample dataset as the first diagnostic benchmark. We extend the analogy from a single pair to multiple pairs, which requires MLLMs to extract symbolic rules from each pair and compose new transformations. Evaluation on the state-of-the-art MLLMs reveals a striking performance gap: even Gemini-2.5 Pro achieving only 40.4% accuracy, far below human-level performance of 100%. Diagnostic analysis shows two consistent failure modes: (1) decomposing visual changes into symbolic rules, and (2) maintaining robustness under diverse or complex settings, highlighting the limitations of current MLLMs on this task.

CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

Abstract

Analogical reasoning tests a fundamental aspect of human cognition: mapping the relation from one pair of objects to another. Existing evaluations of this ability in multimodal large language models (MLLMs) overlook the ability to compose rules from multiple sources, a critical component of higher-order intelligence. To close this gap, we introduce CARV (Compositional Analogical Reasoning in Vision), a novel task together with a 5,500-sample dataset as the first diagnostic benchmark. We extend the analogy from a single pair to multiple pairs, which requires MLLMs to extract symbolic rules from each pair and compose new transformations. Evaluation on the state-of-the-art MLLMs reveals a striking performance gap: even Gemini-2.5 Pro achieving only 40.4% accuracy, far below human-level performance of 100%. Diagnostic analysis shows two consistent failure modes: (1) decomposing visual changes into symbolic rules, and (2) maintaining robustness under diverse or complex settings, highlighting the limitations of current MLLMs on this task.

Paper Structure

This paper contains 28 sections, 1 equation, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Examples of single-step and compositional analogy. The single-step analogy (left) delivers one transformation (color) from input to target, while the compositional analogy (right) delivers the union of color (wood to red) and number (one to two) transformations.
  • Figure 2: Overview of Diagnosis Pipeline. Given two image pairs, the model is instructed to describe the transformations, decompose the transformations into atomic transformations, perform the set operation, and apply the resulting transformation on the query image. Then we apply an evaluator model to check the correctness of each step.
  • Figure 3: Detailed analysis of failure distributions. (a) Across different models, the major bottleneck for closed-source models is decomposition, while for open-source models, it is perception. (b) As we scale the number of atomic transformations ($N$) in image pairs, the portion of decomposition failure significantly increases.
  • Figure 4: Failure Contribution by Property Combinations. For most models, combinations among subject, number, and position contribute most to the failure.
  • Figure 5: Error Distribution by Property Combination of Gemini-2.5 Flash. We study the most challenging combinations, the failure mainly concentrates in decomposition.
  • ...and 2 more figures