Table of Contents
Fetching ...

Perception in Reflection

Yana Wei, Liang Zhao, Kangheng Lin, En Yu, Yuang Peng, Runpei Dong, Jianjian Sun, Haoran Wei, Zheng Ge, Xiangyu Zhang, Vishal M. Patel

TL;DR

Perception in Reflection tackles hallucinations and misperception in large vision-language models by introducing a dual-policy/critic architecture (Reflective Perception, RePer) together with Reflective Perceptual Learning (RPL). It frames perception as an iterative perception-reflection loop, supported by a visual reflection dataset and reflective unlikelihood training to progressively refine understanding. Across benchmarks, RePer improves image understanding, caption detail, and reduces hallucinations, with attention patterns better aligning to human focus. This work establishes perception in reflection as a robust paradigm for future multimodal agents handling complex reasoning and multi-step manipulation.

Abstract

We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models (LVLMs), which are expected yet often fail to achieve perfect perception initially. Specifically, we propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models, enables iterative refinement of visual perception. This framework is powered by Reflective Perceptual Learning (RPL), which reinforces intrinsic reflective capabilities through a methodically constructed visual reflection dataset and reflective unlikelihood training. Comprehensive experimental evaluation demonstrates RePer's quantifiable improvements in image understanding, captioning precision, and hallucination reduction. Notably, RePer achieves strong alignment between model attention patterns and human visual focus, while RPL optimizes fine-grained and free-form preference alignment. These advancements establish perception in reflection as a robust paradigm for future multimodal agents, particularly in tasks requiring complex reasoning and multi-step manipulation.

Perception in Reflection

TL;DR

Perception in Reflection tackles hallucinations and misperception in large vision-language models by introducing a dual-policy/critic architecture (Reflective Perception, RePer) together with Reflective Perceptual Learning (RPL). It frames perception as an iterative perception-reflection loop, supported by a visual reflection dataset and reflective unlikelihood training to progressively refine understanding. Across benchmarks, RePer improves image understanding, caption detail, and reduces hallucinations, with attention patterns better aligning to human focus. This work establishes perception in reflection as a robust paradigm for future multimodal agents handling complex reasoning and multi-step manipulation.

Abstract

We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models (LVLMs), which are expected yet often fail to achieve perfect perception initially. Specifically, we propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models, enables iterative refinement of visual perception. This framework is powered by Reflective Perceptual Learning (RPL), which reinforces intrinsic reflective capabilities through a methodically constructed visual reflection dataset and reflective unlikelihood training. Comprehensive experimental evaluation demonstrates RePer's quantifiable improvements in image understanding, captioning precision, and hallucination reduction. Notably, RePer achieves strong alignment between model attention patterns and human visual focus, while RPL optimizes fine-grained and free-form preference alignment. These advancements establish perception in reflection as a robust paradigm for future multimodal agents, particularly in tasks requiring complex reasoning and multi-step manipulation.

Paper Structure

This paper contains 31 sections, 3 equations, 15 figures, 5 tables, 1 algorithm.

Figures (15)

  • Figure 1: Existing LVLMs are expected to deliver accurate perceptions initially, but humans often reflect and refine answers gradually. We introduce perception in reflection, employing policy and critic model interactions to fully harness perceptual capabilities.
  • Figure 2: Data construction pipeline of visual reflection dataset.
  • Figure 3: Inference pipeline of reflective perception.
  • Figure 4: Comparison of image attention maps between LLaVA-1.5 and RePer, highlighting RePer’s broader activation of image tokens and its ability to generate more detailed and accurate answers. While LLaVA-1.5 over-focuses on “people”, RePer correctly attends to the main subject, “castle,” progressively activating more relevant tokens for improved perception.
  • Figure 5: We use DALLE-3 dalle3 as a text-to-image model to reconstruct images using generated captions. Compared to the original image, reconstructed images from LLaVA-1.5 llava1p5 captions lack key objects or include extraneous ones, indicating incomplete descriptions or hallucinations.
  • ...and 10 more figures