Table of Contents
Fetching ...

Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

Lin Long, Changdae Oh, Seongheon Park, Sharon Li

TL;DR

This work presents the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs and introduces the Total Visual Integration estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation.

Abstract

Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) -- memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.

Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

TL;DR

This work presents the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs and introduces the Total Visual Integration estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation.

Abstract

Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) -- memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.

Paper Structure

This paper contains 38 sections, 7 theorems, 21 equations, 11 figures, 4 tables.

Key Result

Theorem 5.1

Let $X=(X_v,X_t)\in\mathcal{X}$ be a random variable from $\mathcal{P}_{\text{VT}}$ or $\mathcal{P}_{\text{T}}$, and $f_l:\mathcal{X}\rightarrow\mathcal{Z}$ be a layer stack from an LVLM $F_{\theta}$. For $\mathcal{P}_{\text{T}}$, define a density estimator $\hat{p}_{\text{T}}(Z^l):=\mathcal{N}(f_{l where $\bar{\mathbf{H}}$ is a constant $H(p_{\text{VT}}(Z^l))-H(p_{\text{T}}(Z^l))$, and $\text{KL}

Figures (11)

  • Figure 1: Framework Overview. For data from two distributions $\mathcal{P}_{\text{VT}}$ (vision-dependent) and $\mathcal{P}_{\text{t}}$ (vision-independent), we extract chain-of-embedding for two queries w/ and w/o visual input, and use the expected representation distance to spot visual integration point$l^{*}$. Then, estimating total visual integration based on $l^{*}$ allows us to quantify LP of an LVLM per sample.
  • Figure 2: Visual Integration Point. We consistently observe that there is a specific layer $l^{*}$ that clearly distinguish the distance between $Z^{l}_{\text{vis}}$ and $Z^{l}_{\text{blind}}$ across two groups $\mathcal{D}_{\text{VT}}$ and $\mathcal{D}_{\text{T}}$.
  • Figure 3: VIPs of different models observed across different datasets. Our novel framework, fueled by contrasting chain-of-embedding, allows us to consistently observe VIP across multiple models and datasets, and further enables us to estimate TVI to measure language prior.
  • Figure 4: TVI under language priors of different strengths. We see that TVI effectively discerns the differences in strength of LP, thereby standing for a reliable measure for LP.
  • Figure 5: Ablations on model scales. VIP and the dimension-normalized TVI analysis results for three variants of Gemma-3 model family.
  • ...and 6 more figures

Theorems & Definitions (15)

  • Definition 2.1
  • Definition 3.2: Total visual integration estimator
  • Theorem 5.1
  • Theorem 5.2
  • Definition B.1: Visual integration point estimator
  • Lemma F.1
  • proof
  • Lemma F.2
  • proof
  • Theorem F.3: Restatement of Theorem \ref{['thm:info']}
  • ...and 5 more