Table of Contents
Fetching ...

Perceptual Flow Network for Visually Grounded Reasoning

Yangfu Li, Yuning Gong, Hongjian Zhan, Teng Li, Yuanhuiyi Lyu, Tianyi Chen, Qi Liu, Ziyuan Huang, Zhihang Zhong, Dandan Zheng, Yue Lu

Abstract

Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

Perceptual Flow Network for Visually Grounded Reasoning

Abstract

Despite the success of Large-Vision Language Models (LVLMs), general optimization objectives (e.g., standard MLE) fail to constrain visual trajectories, leading to language bias and hallucination. To mitigate this, current methods introduce geometric priors from visual experts as additional supervision. However, we observe that such supervision is typically suboptimal: it is biased toward geometric precision and offers limited reasoning utility. To bridge this gap, we propose Perceptual Flow Network (PFlowNet), which eschews rigid alignment with the expert priors and achieves interpretable yet more effective visual reasoning. Specifically, PFlowNet decouples perception from reasoning to establish a self-conditioned generation process. Based on this, it integrates multi-dimensional rewards with vicinal geometric shaping via variational reinforcement learning, thereby facilitating reasoning-oriented perceptual behaviors while preserving visual reliability. PFlowNet delivers a provable performance guarantee and competitive empirical results, particularly setting new SOTA records on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

Paper Structure

This paper contains 37 sections, 4 theorems, 98 equations, 17 figures, 6 tables.

Key Result

Theorem 3.1

Under Assump. assump:1, assump:2, we suppose valid support $\mathcal{S}_{\rm V}$ satisfies $d_{\rm eff}$-regularity, where $d_{\rm eff}$ is its effective dimension; thus, $\exists\kappa \ge 1$ such that $q \coloneqq s_{\mathcal{B}}/s_{\rm V} \ge \kappa(\varepsilon/\sigma)^{d_{\rm eff}}$. Suppose the

Figures (17)

  • Figure 1: Impact of evidence geometric precision (IoU w.r.t. the expert annotations) on reasoning performance (accuracy). The evidence with minimum and maximum precision is actually the full image and the expert annotation (outlined in red), respectively.
  • Figure 2: Illustration of feasible regions ($\mathcal{S}_{\rm v}$) and optimization objectives for visually grounded reasoning. Existing methods constrain LVLMs to imitate expert trajectories by maximizing their geometric consistency, whereas PFlowNet integrates a reasoning-oriented reward with vicinal geometric shaping to achieve more sufficient yet controlled exploration, leading to reliable and high-efficacy reasoning.
  • Figure 3: Overview of PFlowNet that consists of two decoupled stages: flow generation and flow-guided reasoning. We leverage a frozen reward model with the multi-dimensional reward to guide PFlowNet toward reasoning-oriented yet visually reliable perceptual flows. During reasoning, PFlowNet integrates the textual flow with corresponding visual features to derive interpretable and accurate answers.
  • Figure 4: Data pipeline for perceptual flow synthesis.
  • Figure 5: Statistics of the Cold-Start Dataset. Notably, as the number of RoIs increases, the average character length of the Planning State remains largely stable, whereas that of the Perceptual States grows substantially.
  • ...and 12 more figures

Theorems & Definitions (14)

  • Definition 2.1: Visually Grounded Reasoning
  • Definition 2.2: Perceptual Flow
  • Theorem 3.1: Total Variation Distance Bound
  • Remark 3.2: Limit Analysis w.r.t. $\lambda$
  • Remark 3.3: Limit Analysis w.r.t. $\varepsilon$
  • Theorem 3.4: Guaranteed Improvement over Baselines
  • Remark 3.5
  • Assumption A.1: Uniform Prior
  • Assumption A.2: Faithful Captioning with Non-Informative Prior
  • Lemma A.3: Reward Consistency
  • ...and 4 more