Table of Contents
Fetching ...

CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT

Chengyi Du, Yazhe Niu, Dazhong Shen, Luxin Xu

TL;DR

This work tackles the gap in human-like visual reasoning for vision-language models by introducing CoTZero, an annotation-free framework that enforces hierarchical, verifiable reasoning. It combines a dual-stage data synthesis pipeline to build multi-granularity CoT data with a cognition-aligned training regime that uses GRPO and cognitively coherent rewards (CCVR). In data synthesis, the bottom-up stage extracts atomic primitives into entity-relation-entity triples and generates atomic questions with lexically perturbed negatives, then the top-down stage decomposes complex questions to yield multi-level supervision. Experiments on a semantic inconsistency benchmark with lexical perturbations show strong improvements, with notable gains in in-domain and out-of-domain settings, demonstrating that structured CoT data and process-oriented rewards can substantially enhance robust, human-like reasoning in VLMs.

Abstract

Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally composes them into diverse, structured question-reasoning forms. In the top-down stage, it enforces hierarchical reasoning by using coarse global structure to guide the interpretation of local details and causal relations. In the cognition-aligned training component, built on the synthesized CoT data, we introduce Cognitively Coherent Verifiable Rewards (CCVR) in Reinforcement Fine-Tuning (RFT) to further strengthen VLMs' hierarchical reasoning and generalization, providing stepwise feedback on reasoning coherence and factual correctness. Experiments show that CoTZero achieves an F1 score of 83.33 percent on our multi-level semantic inconsistency benchmark with lexical-perturbation negatives, across both in-domain and out-of-domain settings. Ablations confirm that each component contributes to more interpretable and human-aligned visual reasoning.

CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT

TL;DR

This work tackles the gap in human-like visual reasoning for vision-language models by introducing CoTZero, an annotation-free framework that enforces hierarchical, verifiable reasoning. It combines a dual-stage data synthesis pipeline to build multi-granularity CoT data with a cognition-aligned training regime that uses GRPO and cognitively coherent rewards (CCVR). In data synthesis, the bottom-up stage extracts atomic primitives into entity-relation-entity triples and generates atomic questions with lexically perturbed negatives, then the top-down stage decomposes complex questions to yield multi-level supervision. Experiments on a semantic inconsistency benchmark with lexical perturbations show strong improvements, with notable gains in in-domain and out-of-domain settings, demonstrating that structured CoT data and process-oriented rewards can substantially enhance robust, human-like reasoning in VLMs.

Abstract

Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally composes them into diverse, structured question-reasoning forms. In the top-down stage, it enforces hierarchical reasoning by using coarse global structure to guide the interpretation of local details and causal relations. In the cognition-aligned training component, built on the synthesized CoT data, we introduce Cognitively Coherent Verifiable Rewards (CCVR) in Reinforcement Fine-Tuning (RFT) to further strengthen VLMs' hierarchical reasoning and generalization, providing stepwise feedback on reasoning coherence and factual correctness. Experiments show that CoTZero achieves an F1 score of 83.33 percent on our multi-level semantic inconsistency benchmark with lexical-perturbation negatives, across both in-domain and out-of-domain settings. Ablations confirm that each component contributes to more interpretable and human-aligned visual reasoning.
Paper Structure (23 sections, 10 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 10 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our annotation-free data generation pipeline. Starting from external image inputs, a VLM produces rich captions, which are structured into (entity, relation, entity) triples by an LLM. These triples are then transformed into yes/no QA pairs through controlled prompting. This pipeline enables scalable, fine-grained, and semantically consistent supervision without human annotation.
  • Figure 2: Comparison between our CoT process and previous CoT process.
  • Figure 3: Atomic-level questions derived from image captions are merged bottom-up into higher-level reasoning chains based on semantic similarity, forming a question hierarchy. In turn, complex questions are decomposed top-down to generate training data with multi-granularity supervision.
  • Figure 4: Data generation details.
  • Figure 5: The composition of our test set.
  • ...and 1 more figures