Table of Contents
Fetching ...

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, Lijuan Wang

TL;DR

ViCrit addresses the scarcity of vision-centric tasks that are both perceptually challenging and automatically verifiable by turning visual perception into a reinforcement learning proxy. It injects a subtle visual hallucination into a long caption and trains vision-language models to pinpoint the corrupted span, using a deterministic exact-match reward and a GRPO objective. The approach yields consistent improvements across a broad VL benchmark suite and transfers to abstract image reasoning and visual math, indicating learned perceptual strategies beyond memorization. To enable robust evaluation, ViCrit-Bench provides a fine-grained, category-balanced diagnostic set spanning four image domains and eight hallucination types, with strong correlations to general VL performance. Overall, ViCrit highlights the value of fine-grained, verifiable perceptual objectives for advancing visual perception in VLMs and offers a practical diagnostic and training framework for future multimodal systems.

Abstract

Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

TL;DR

ViCrit addresses the scarcity of vision-centric tasks that are both perceptually challenging and automatically verifiable by turning visual perception into a reinforcement learning proxy. It injects a subtle visual hallucination into a long caption and trains vision-language models to pinpoint the corrupted span, using a deterministic exact-match reward and a GRPO objective. The approach yields consistent improvements across a broad VL benchmark suite and transfers to abstract image reasoning and visual math, indicating learned perceptual strategies beyond memorization. To enable robust evaluation, ViCrit-Bench provides a fine-grained, category-balanced diagnostic set spanning four image domains and eight hallucination types, with strong correlations to general VL performance. Overall, ViCrit highlights the value of fine-grained, verifiable perceptual objectives for advancing visual perception in VLMs and offers a practical diagnostic and training framework for future multimodal systems.

Abstract

Reinforcement learning (RL) has shown great effectiveness for fine-tuning large language models (LLMs) using tasks that are challenging yet easily verifiable, such as math reasoning or code generation. However, extending this success to visual perception in vision-language models (VLMs) has been impeded by the scarcity of vision-centric tasks that are simultaneously challenging and unambiguously verifiable. To this end, we introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions. Starting from a 200-word captions, we inject a single, subtle visual description error-altering a few words on objects, attributes, counts, or spatial relations-and task the model to pinpoint the corrupted span given the image and the modified caption. This formulation preserves the full perceptual difficulty while providing a binary, exact-match reward that is easy to compute and unambiguous. Models trained with the ViCrit Task exhibit substantial gains across a variety of VL benchmarks. Crucially, the improvements transfer beyond natural-image training data to abstract image reasoning and visual math, showing promises of learning to perceive rather than barely memorizing seen objects. To facilitate evaluation, we further introduce ViCrit-Bench, a category-balanced diagnostic benchmark that systematically probes perception errors across diverse image domains and error types. Together, our results demonstrate that fine-grained hallucination criticism is an effective and generalizable objective for enhancing visual perception in VLMs.

Paper Structure

This paper contains 22 sections, 1 equation, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of the ViCrit framework. Starting from high-quality image–caption pairs, we synthetically inject visual hallucinations by minimally altering noun phrases. The model is trained to localize incorrect spans in the caption given the image, receiving a verifiable reward through exact string matching. This fine-grained perceptual objective improves visual perception in vision-language models (VLMs) and generalizes to downstream reasoning tasks across diverse visual domains.
  • Figure 2: Instead of asking the model to write a paragraph-long caption that is hard to grade (e.g., the 200-word example above), ViCrit feeds the model an almost-correct caption containing a single, deliberately inserted visual hallucination and trains it to locate that error. The short, token-level response is just as demanding in terms of visual perception, yet it is far easier to verify automatically.
  • Figure 3: Data examples from ViCrit-Bench, which involve four image categories and eight visual hallucination types. We manually verify each image's long caption, and carefully inject different kinds of proper visual hallucinations.
  • Figure 4: Hallucination task distribution of ViCrit-Bench.
  • Figure 5: Correlation between average VLM task performance and ViCrit-Bench performance (Task Avg. and Overall columns in Table \ref{['tab:bench-exp']}). Each point represents a different model, and the fitted linear regression line highlights a strong positive relationship, indicating that better ViCrit-Bench results are associated with higher stronger VLM capabilities.
  • ...and 1 more figures