CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT
Chengyi Du, Yazhe Niu, Dazhong Shen, Luxin Xu
TL;DR
This work tackles the gap in human-like visual reasoning for vision-language models by introducing CoTZero, an annotation-free framework that enforces hierarchical, verifiable reasoning. It combines a dual-stage data synthesis pipeline to build multi-granularity CoT data with a cognition-aligned training regime that uses GRPO and cognitively coherent rewards (CCVR). In data synthesis, the bottom-up stage extracts atomic primitives into entity-relation-entity triples and generates atomic questions with lexically perturbed negatives, then the top-down stage decomposes complex questions to yield multi-level supervision. Experiments on a semantic inconsistency benchmark with lexical perturbations show strong improvements, with notable gains in in-domain and out-of-domain settings, demonstrating that structured CoT data and process-oriented rewards can substantially enhance robust, human-like reasoning in VLMs.
Abstract
Recent advances in vision-language models (VLMs) have markedly improved image-text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: (i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method. In the first component, we draw inspiration from neurocognitive accounts of compositional productivity and global-to-local analysis. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally composes them into diverse, structured question-reasoning forms. In the top-down stage, it enforces hierarchical reasoning by using coarse global structure to guide the interpretation of local details and causal relations. In the cognition-aligned training component, built on the synthesized CoT data, we introduce Cognitively Coherent Verifiable Rewards (CCVR) in Reinforcement Fine-Tuning (RFT) to further strengthen VLMs' hierarchical reasoning and generalization, providing stepwise feedback on reasoning coherence and factual correctness. Experiments show that CoTZero achieves an F1 score of 83.33 percent on our multi-level semantic inconsistency benchmark with lexical-perturbation negatives, across both in-domain and out-of-domain settings. Ablations confirm that each component contributes to more interpretable and human-aligned visual reasoning.
