Table of Contents
Fetching ...

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin

Abstract

Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Abstract

Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.
Paper Structure (46 sections, 4 equations, 4 figures, 4 tables)

This paper contains 46 sections, 4 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: MM-CondChain targets visually grounded deep conditional reasoning beyond prior benchmarks.Top: existing benchmarks typically evaluate either shallow single-layer visual compositions or independent instruction constraints. Bottom left: MM-CondChain introduces nested, inter-layer conditional chains with rich intra-layer compositional predicates, where a minimally perturbed condition can create a hard negative that changes the execution path and causes early termination. Bottom right: experiments show that even advanced MLLMs achieve limited performance on this benchmark, highlighting visually grounded deep compositional reasoning as a fundamental challenge.
  • Figure 2: Overview of the MM-CondChain agentic synthesis pipeline. Given a multimodal input, the Planner iteratively extends a conditional chain: at each layer, structured facts are extracted, a VPIR predicate pair is generated and verified via code execution, and the logic is rendered into natural language. The Composer then compiles the verified chain into paired True-path and False-path instances for evaluation.
  • Figure 3: Top attribute frequencies in extracted facts and VPIR variables across domains. (a,c,e) show the top 20 attributes in extracted facts for the Natural, Chart, and GUI domains, respectively; (b,d,f) show the top 20 variables used in VPIR predicates for the corresponding domains.
  • Figure 4: Logic pattern composition of VPIR expressions. Left: overall distribution of high-level VPIR logic families. Middle: top-20 dominant concrete VPIR templates. Right: an example showing how a VPIR template is instantiated into executable predicates and natural-language conditions.