Table of Contents
Fetching ...

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Zhiyu Pan, Yizheng Wu, Jiashen Hua, Junyi Feng, Shaotian Yan, Bing Deng, Zhiguo Cao, Jieping Ye

TL;DR

Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning.

Abstract

Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC-STaR.

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

TL;DR

Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning.

Abstract

Reasoning has emerged as a key capability of large language models. In linguistic tasks, this capability can be enhanced by self-improving techniques that refine reasoning paths for subsequent finetuning. However, extending these language-based self-improving approaches to vision language models (VLMs) presents a unique challenge:~visual hallucinations in reasoning paths cannot be effectively verified or rectified. Our solution starts with a key observation about visual contrast: when presented with a contrastive VQA pair, i.e., two visually similar images with synonymous questions, VLMs identify relevant visual cues more precisely. Motivated by this observation, we propose Visual Contrastive Self-Taught Reasoner (VC-STaR), a novel self-improving framework that leverages visual contrast to mitigate hallucinations in model-generated rationales. We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR. Consequently, we obtain a new visual reasoning dataset, VisCoR-55K, which is then used to boost the reasoning capability of various VLMs through supervised finetuning. Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets, demonstrating that the inherent contrastive ability of VLMs can bootstrap their own visual reasoning. Project at: https://github.com/zhiyupan42/VC-STaR.
Paper Structure (15 sections, 4 equations, 9 figures, 4 tables)

This paper contains 15 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Contrasting makes the VLM see better. (a) Contrastive VQA pairs compels a more accurate response. (b) Compared with a previous self-improving method STaR NEURIPS2022_639a9a17 that enhances the quality of reasoning with hints (ground-truth answers), contrasting with hints can rectify more cases. The blocks along the $x$-axis mark initial VLM failures. The color of each block indicates the outcome of rectifying: green for success and gray for failure. Tested VLM is Qwen$2.5$VL-$7$B bai2025qwen25vltechnicalreport.
  • Figure 2: VisCoR-$55$K. We introduce the Visual Contrastive Reasoning dataset (VisCoR-$55$K), a new collection of $55$K high-quality visual reasoning samples. Spanning the domains of general VQA, reasoning, math, graph/chart, and OCR, each sample is created by leveraging a contrastive counterpart to generate a faithful rationale. Rationales are shown in the Sec. \ref{['sec:A3']}.
  • Figure 3: Contrastive VQA pair curation pipeline. To facilitate effective contrastive analysis, we curate corresponding challenging counterparts for VQA samples from a pool of diverse datasets. Each curated pair consists of two samples that share a synonymous question but feature distinct yet semantically similar images. Collected pairs are filtered by a difficulty-based sampling procedure.
  • Figure 4: Faithful rationale generation pipeline. A contrastive analysis can be obtained based on the curated contrastive VQA pair. Leveraging the property of VLMs illustrated in Fig. \ref{['fig1']}, the contrastive analysis is then used to trigger a rethinking procedure, which refines the naive rationale into a more faithful one. This pipeline is designed to generate rationales for supervised finetuning.
  • Figure 5: Qualitative Comparison with base model. The second row shows the directly response from the base model, the third row shows the response when the base model is prompted to "think stey by step", the last row shows the model improved with our VC-STaR. We highlight the key visual evidences with red boxes for clarity of visualization. More results are in Sec. \ref{['sec:A3']}.
  • ...and 4 more figures