Table of Contents
Fetching ...

See, Think, Learn: A Self-Taught Multimodal Reasoner

Sourabh Sharma, Sonam Gupta, Sadbhawna

TL;DR

The paper tackles the bottleneck of robust multimodal reasoning in vision-language models by reducing reliance on costly human or proprietary CoT data. It introduces See-Think-Learn (STL), a self-training framework that enforces a see-before-thinking rationale structure and uses both positive and negative rationales to jointly improve perception and reasoning. Through cross-domain experiments on M3CoT, STL achieves strong gains over answer-only and several self-training baselines, approaching performance of human-annotated rationales in some domains. The approach demonstrates a scalable, data-efficient path to more faithful and discriminative multimodal reasoning in VLMs.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate perception and robust reasoning, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models, or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called See-Think-Learn (STL). At its core, STL introduces a structured reasoning template that encourages the model to see before thinking, first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model's ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs.

See, Think, Learn: A Self-Taught Multimodal Reasoner

TL;DR

The paper tackles the bottleneck of robust multimodal reasoning in vision-language models by reducing reliance on costly human or proprietary CoT data. It introduces See-Think-Learn (STL), a self-training framework that enforces a see-before-thinking rationale structure and uses both positive and negative rationales to jointly improve perception and reasoning. Through cross-domain experiments on M3CoT, STL achieves strong gains over answer-only and several self-training baselines, approaching performance of human-annotated rationales in some domains. The approach demonstrates a scalable, data-efficient path to more faithful and discriminative multimodal reasoning in VLMs.

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate perception and robust reasoning, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models, or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called See-Think-Learn (STL). At its core, STL introduces a structured reasoning template that encourages the model to see before thinking, first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model's ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs.

Paper Structure

This paper contains 19 sections, 10 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison of reasoning generated by our "See-Think-Learn‚Äù (STL) framework with STaR zelikman2022star and R3V cheng2025vision. STL produces more detailed and perceptually grounded rationales, whereas STaR and R3V tend to overlook contextual cues and provide shorter, less comprehensive explanations.
  • Figure 2: An overview of our detailed “See-Think-Learn” (STL) framework. In this framework, each image-question pair with multiple choices, together with a positive rationale prompt, is fed into the VLM to generate a caption, reasoning, and conclusion. If the model predicts the correct answer, the tuple [Question, (Caption, Reasoning, Answer)] is stored as a Positive Rationale in the Rationale Trainset. The remaining incorrect choices are used to generate negative rationalizations, producing a caption and an explanation of why the choice is incorrect, which are stored as Negative Rationales [Question, (Caption, Explanation)]. The VLM is then iteratively fine-tuned on this dynamically constructed Rationale Trainset.
  • Figure 3: Prompt templates used for positive and negative rationalization in the STL framework.
  • Figure 4: Comparison of our "See-Think-Learn” (STL) framework with CoT Prompting. The example is taken from the Commonsense Split of M3CoT Dataset chen-etal-2024-m3cot. Unlike CoT prompting (a), our STL framework ((b) and (c)) effectively generates a detailed description and accurate reasoning for the image by leveraging the proposed Positive and Negative Rationale Prompts. In (a), the answer is incorrect, and the image description is missing. In (b), although a detailed description is provided, it is inaccurate. For example, it mentions a “fork” and “knife” that are not present in the image. In contrast, (c) produces both the correct answer and an accurate description, capturing key elements such as “serve” and “buffet”. Q: Question; O: Options;
  • Figure 5: Qualitative Comparison on Natural Science Domain. Qualitative analysis shows that STL (ours) produces more coherent and logically consistent explanations than STaR, indicating deeper understanding and more faithful reasoning.
  • ...and 5 more figures