Table of Contents
Fetching ...

Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

Kesen Zhao, Beier Zhu, Qianru Sun, Hanwang Zhang

TL;DR

This work introduces UV-CoT, an unsupervised framework for image-level chain-of-thought in multimodal language models by learning from preference-ordered region-based data. It automatically generates seed regions, uses an evaluator to rank model responses, and trains with a Score-DPO loss that weights the degree of region-based preference, enabling iterative refinement without bounding-box annotations. Empirical results across ten datasets and four zero-shot tasks show UV-CoT achieves state-of-the-art or competitive performance with significantly less labeled data, and strong generalization to unseen domains. The approach reduces annotation costs while improving visual reasoning, demonstrating the practicality and impact of preference-based, region-aware CoT learning for multimodal systems.

Abstract

Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). However, existing approaches are focused on text CoT, limiting their ability to leverage visual cues. Visual CoT remains underexplored, and the only work is based on supervised fine-tuning (SFT) that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoT performs preference comparisons between model-generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box annotations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bounding boxes using a template prompt and then answers the question using each bounded region as input. An evaluator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLM with UV-CoT by minimizing negative log-likelihood losses. By emulating human perception--identifying key regions and reasoning based on them--UV-CoT can improve visual comprehension, particularly in spatial reasoning tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual and visual CoT methods. Our zero-shot testing on four unseen datasets shows the strong generalization of UV-CoT. The code is available in https://github.com/kesenzhao/UV-CoT.

Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

TL;DR

This work introduces UV-CoT, an unsupervised framework for image-level chain-of-thought in multimodal language models by learning from preference-ordered region-based data. It automatically generates seed regions, uses an evaluator to rank model responses, and trains with a Score-DPO loss that weights the degree of region-based preference, enabling iterative refinement without bounding-box annotations. Empirical results across ten datasets and four zero-shot tasks show UV-CoT achieves state-of-the-art or competitive performance with significantly less labeled data, and strong generalization to unseen domains. The approach reduces annotation costs while improving visual reasoning, demonstrating the practicality and impact of preference-based, region-aware CoT learning for multimodal systems.

Abstract

Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). However, existing approaches are focused on text CoT, limiting their ability to leverage visual cues. Visual CoT remains underexplored, and the only work is based on supervised fine-tuning (SFT) that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoT performs preference comparisons between model-generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box annotations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bounding boxes using a template prompt and then answers the question using each bounded region as input. An evaluator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLM with UV-CoT by minimizing negative log-likelihood losses. By emulating human perception--identifying key regions and reasoning based on them--UV-CoT can improve visual comprehension, particularly in spatial reasoning tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual and visual CoT methods. Our zero-shot testing on four unseen datasets shows the strong generalization of UV-CoT. The code is available in https://github.com/kesenzhao/UV-CoT.

Paper Structure

This paper contains 25 sections, 14 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Comparison of Visual-CoT shao2025visual and our UV-CoT. Left: Visual-CoT relies on human-annotated bounding boxes to identify key regions. The model is trained via supervised fine-tuning to maximize the likelihood of the labeled data. Right: UV-CoT eliminates the need for human annotation. Given an image, the target model generates seed bounding boxes and answers questions based on these regions, respectively. An evaluator MLLM then scores the responses as a proxy for assessing region quality. Lastly, the target model is optimized via preference optimization by maximizing the likelihood of preferred regions over dis-preferred ones.
  • Figure 2: Illustration of UV-CoT reasoning.
  • Figure 3: (a&b) Bounding box evaluation on (a) training datasets and (b) zero-shot datasets. Our UV-CoT performs better than Visual-CoT. (c) Model performance under varying visual token sizes. Our UV-CoT demonstrates better token efficiency.
  • Figure 4: Visualization of preference data generated by \ref{['alg:process']}. Preferred BBoxes are in red. Dis-preferred BBoxes are in blue.
  • Figure 5: Visualization of our $\texttt{UV-CoT}$ inference. Model-generated bounding boxes are shown in red.