Table of Contents
Fetching ...

Latent Chain-of-Thought for Visual Reasoning

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao

TL;DR

Latent Chain-of-Thought for Visual Reasoning reframes visual reasoning as posterior inference over latent CoTs, addressing generalization and interpretability gaps in LVLMs. The authors develop LaCoT, an AVI framework built on GFlowNets, introducing token-level reward approximation (ISubTB), reference-guided exploration (RGFN), and Bayesian inference over latent rationales (BiN) for scalable inference. Empirical results on diverse multimodal benchmarks show LaCoT improves reasoning accuracy, diversity of rationales, and inference efficiency, outperforming SFT and RL-based baselines on several tasks. The approach offers a principled, scalable path to robust, interpretable visual reasoning with broad applicability to LVLMs.

Abstract

Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.

Latent Chain-of-Thought for Visual Reasoning

TL;DR

Latent Chain-of-Thought for Visual Reasoning reframes visual reasoning as posterior inference over latent CoTs, addressing generalization and interpretability gaps in LVLMs. The authors develop LaCoT, an AVI framework built on GFlowNets, introducing token-level reward approximation (ISubTB), reference-guided exploration (RGFN), and Bayesian inference over latent rationales (BiN) for scalable inference. Empirical results on diverse multimodal benchmarks show LaCoT improves reasoning accuracy, diversity of rationales, and inference efficiency, outperforming SFT and RL-based baselines on several tasks. The approach offers a principled, scalable path to robust, interpretable visual reasoning with broad applicability to LVLMs.

Abstract

Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.

Paper Structure

This paper contains 20 sections, 1 theorem, 16 equations, 11 figures, 7 tables.

Key Result

Proposition 1

Let $R(z_{1:t}\top)=\log P(Xz_{1:t}Y)$ be a joint-likelihood reward function. (a) If $R(z_{1:-})$ and $R(z_{1:-+\lambda})$ are true reward and the intermediate rewards within region of length $\lambda$ are constantly increment, then we can approximate the reward at step $t+i$ (where $0 \leq i \leq \ (b) If $\lambda$ is short enough, the interpolation reward error stays close to 0 and the flow betw

Figures (11)

  • Figure 1: Comparison of different training algorithms for visual reasoning. PPO implicitly approximates the rationale distribution but tends to under‑represent its full diversity due to limited exploration constrained by its reference policy (e.g., the SFT model), and it heavily relies on a critic (reward) model. In contrast, AVI explicitly estimates the true target posterior $P(Z|X,Y)$ through latent rationales, which promote diverse trajectories and inherently prevent reward hacking.
  • Figure 2: Within a complete rationale sequence, we compute the actual reward after each $\lambda$ steps and adopt a linear interpolation strategy to estimate the intermediate steps.
  • Figure 3: Allowing the policy model to explore the state space without constraint causes the catastrophic forgetting issue. The proposed reference-guided exploration effectively addresses this problem.
  • Figure 4: Inference pipeline of BiN.
  • Figure 5: Input sequence of training a reasoning LVLM. We use token to represent learnable parts. Specifically, the fine-tuned reasoning LVLM heavily relies on annotated data during optimization, and the object tokens followed by Assistant enforce reasoning for all instructions. We introduce a new role token Analyzer, so the model can selectively provide reasoning steps.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Proposition 1
  • proof
  • proof