Table of Contents
Fetching ...

Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding

Beomsik Cho, Jaehyung Kim

TL;DR

This work proposes ReVisiT, a simple training-free decoding method that references vision tokens to guide text generation and leverages the semantic information embedded within vision tokens by projecting them into the text token distribution.

Abstract

Large Vision-Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. Building on these observations, we propose ReVisiT, a simple training-free decoding method that references vision tokens to guide text generation. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution. Specifically, ReVisiT dynamically selects the most relevant vision token at each decoding step via context-aware constrained divergence minimization, and using its constrained projection to refine the output distribution to better incorporate visual semantics. Across five benchmarks on recent LVLMs, ReVisiT consistently enhances visual grounding with minimal computational overhead, and achieves competitive or superior results to state-of-the-art decoding baselines while reducing computational cost by up to $2\times$.

Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding

TL;DR

This work proposes ReVisiT, a simple training-free decoding method that references vision tokens to guide text generation and leverages the semantic information embedded within vision tokens by projecting them into the text token distribution.

Abstract

Large Vision-Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding. However, how vision information contributes to the model's decoding process remains under-explored, as reflected in frequent hallucinations. Through a series of analyses, we found that (i) vision tokens provide meaningful visual information even when hallucinations occur, and (ii) their semantics are encoded in the textual space and become explicit under appropriate vocabulary constraints. Building on these observations, we propose ReVisiT, a simple training-free decoding method that references vision tokens to guide text generation. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution. Specifically, ReVisiT dynamically selects the most relevant vision token at each decoding step via context-aware constrained divergence minimization, and using its constrained projection to refine the output distribution to better incorporate visual semantics. Across five benchmarks on recent LVLMs, ReVisiT consistently enhances visual grounding with minimal computational overhead, and achieves competitive or superior results to state-of-the-art decoding baselines while reducing computational cost by up to .

Paper Structure

This paper contains 49 sections, 21 equations, 10 figures, 15 tables, 1 algorithm.

Figures (10)

  • Figure 1: ReVisiT across various model sizes and architectures. We evaluate on the CHAIR benchmark and report F1. Consistent improvements hold across different size buckets (7--8B / 13--14B / 26--32B) and architectures, demonstrating scalability and model-agnostic effectiveness. Full results are provided in Table \ref{['tab:appendix_chair_by_model']}.
  • Figure 2: Motivation. We qualitatively analyzed various vision tokens with LLaVA-1.5-7B liu2024improved. Dotted arrows represent vision token projection over specified vocabulary set. For each box, representing text token distribution, we annotated top-5 probable text tokens. Left part illustrate the effectiveness of vocabulary constraint, whereas right part shows the distribution shift during ReVisiT. See Appendix \ref{['appendix:motivation-fig']} for a detailed discussion of the underlying values and analysis.
  • Figure 3: Ground-truth objects frequently remain in top-probability predictions. At 190 hallucinated steps, at least one GT object is recalled in 95.8% of cases within the top-500 predictions, corresponding to only 0.33% of full vocabulary (151,665 tokens).
  • Figure 4: Vision token projection. Vision tokens reveal rich semantics when projected over semantically coherent subsets.
  • Figure 5: An overview of ReVisiT. At each decoding step, ReVisiT (1) constrains the vocabulary $\mathcal{V}$ to $\mathcal{V}_{\texttt{cons}}^t$, (2) projects vision token embeddings into $\mathcal{V}_{\texttt{cons}}^t$ and selects most relevant token, and (3) refines the final output distribution. ReVisiT leverages vision tokens to serve as reference signals for decoding, enhancing visual grounding without additional forward passes or external supervision.
  • ...and 5 more figures