Table of Contents
Fetching ...

Sherlock: Self-Correcting Reasoning in Vision-Language Models

Yi Ding, Ruqi Zhang

TL;DR

Sherlock tackles the fragility and data demands of reasoning in vision-language models by introducing trajectory-level self-correction and self-improvement. Built on Llama3.2-Vision-11B-Instruct, it uses a 3-stage training pipeline (SFT cold-start, offline trajectory-level preference training with visual perturbations and a dynamic $\beta$, and online self-generated data) and achieves state-of-the-art results across eight multimodal benchmarks with only 20k annotated samples. It demonstrates that self-correction can be leveraged to both improve direct reasoning and enable continual self-improvement without external supervision, and that inference-time scaling with verifiers further boosts efficiency. The work suggests a generalizable path toward data-efficient, domain-robust reasoning in multimodal models by tightly coupling correction signals with preference-based learning.

Abstract

Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic $β$ for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.

Sherlock: Self-Correcting Reasoning in Vision-Language Models

TL;DR

Sherlock tackles the fragility and data demands of reasoning in vision-language models by introducing trajectory-level self-correction and self-improvement. Built on Llama3.2-Vision-11B-Instruct, it uses a 3-stage training pipeline (SFT cold-start, offline trajectory-level preference training with visual perturbations and a dynamic , and online self-generated data) and achieves state-of-the-art results across eight multimodal benchmarks with only 20k annotated samples. It demonstrates that self-correction can be leveraged to both improve direct reasoning and enable continual self-improvement without external supervision, and that inference-time scaling with verifiers further boosts efficiency. The work suggests a generalizable path toward data-efficient, domain-robust reasoning in multimodal models by tightly coupling correction signals with preference-based learning.

Abstract

Reasoning Vision-Language Models (VLMs) have shown promising performance on complex multimodal tasks. However, they still face significant challenges: they are highly sensitive to reasoning errors, require large volumes of annotated data or accurate verifiers, and struggle to generalize beyond specific domains. To address these limitations, we explore self-correction as a strategy to enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning VLMs' self-correction abilities and identify key gaps. Based on our findings, we introduce Sherlock, a self-correction and self-improvement training framework. Sherlock introduces a trajectory-level self-correction objective, a preference data construction method based on visual perturbation, and a dynamic for preference tuning. Once the model acquires self-correction capabilities using only 20k randomly sampled annotated data, it continues to self-improve without external supervision. Built on the Llama3.2-Vision-11B model, Sherlock achieves remarkable results across eight benchmarks, reaching an average accuracy of 64.1 with direct generation and 65.4 after self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and LlamaV-o1 (63.4) while using less than 20% of the annotated data.

Paper Structure

This paper contains 73 sections, 16 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Example of Sherlock's self-correction ability. The direct generation contains errors that cause the trajectory to deviate from the correct path and result in an incorrect answer. Sherlock successfully corrects the previous response and obtains the correct answer.
  • Figure 2: Left: Overview of experimental settings for self-correction analysis. Blue block illustrates the Modified One Step process using Qwen2.5-7B-Instruct yang2024qwen2, while Green block represents two correction strategies applied to direct generations: external critique-based correction and self-correction prompt. Right: Reasoning performance of LLaVA-CoT xu2024llava and VL-Rethinker wang2025vl under different settings, evaluated on MMStar chen2024we and MathVista lu2023mathvista.
  • Figure 3: Training pipeline of Sherlock, including: (Left) SFT cold-start stage, (Middle) offline preference training, and (Right) online iterative self-improvement. In the SFT and offline stages, we randomly sample 10k $\mathcal{D}_A$ and 10k $\mathcal{D}_B$ with ground truth from the 100k LLaVA-CoT xu2024llava dataset as supervision. During the online stage, each iteration samples only 5k unlabeled inputs, from which a self-constructed and self-labeled dataset is built using the selection rule illustrated in the (Right) part.
  • Figure 4: Average accuracy across 8 benchmarks for ablation settings. w/ i=1 indicates that the objective in Eq. \ref{['eq:sc_loss']} performs self-correction on the entire response instead of trajectory-level.
  • Figure 5: Values of dynamic $\beta$ under different truncation steps $i$ and visual perturbation levels $\epsilon$.
  • ...and 2 more figures