Table of Contents
Fetching ...

LangBridge: Interpreting Image as a Combination of Language Embeddings

Jiaqi Liao, Yuwei Niu, Fanqing Meng, Hao Li, Changyao Tian, Yinuo Du, Yuwen Xiong, Dianqi Li, Xizhou Zhu, Li Yuan, Jifeng Dai, Yu Cheng

TL;DR

LangBridge addresses how LVLMs bridge the vision-language gap by showing that MLPs progressively project visual features into subspaces spanned by corresponding text embeddings. It introduces Language Basis Vector Projection, representing visual embeddings as linear combinations of LLM vocabulary embeddings, enabling a pretraining-free adapter that can be reused across different LLMs. Empirical results demonstrate competitive performance with standard MLPs and robust cross-architecture transfer across Qwen and LLaMA families, with notable efficiency gains by avoiding repetitive pretraining. The work provides an interpretable grounding of visual information in vocabulary space and offers a practical pathway to scalable, multi-LLM LVLM deployments.

Abstract

Recent years have witnessed remarkable advances in Large Vision-Language Models (LVLMs), which have achieved human-level performance across various complex vision-language tasks. Following LLaVA's paradigm, mainstream LVLMs typically employ a shallow MLP for visual-language alignment through a two-stage training process: pretraining for cross-modal alignment followed by instruction tuning. While this approach has proven effective, the underlying mechanisms of how MLPs bridge the modality gap remain poorly understood. Although some research has explored how LLMs process transformed visual tokens, few studies have investigated the fundamental alignment mechanism. Furthermore, the MLP adapter requires retraining whenever switching LLM backbones. To address these limitations, we first investigate the working principles of MLP adapters and discover that they learn to project visual embeddings into subspaces spanned by corresponding text embeddings progressively. Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings. This innovative design enables pretraining-free adapter transfer across different LLMs while maintaining performance. Our experimental results demonstrate that a LangBridge adapter pre-trained on Qwen2-0.5B can be directly applied to larger models such as LLaMA3-8B or Qwen2.5-14B while maintaining competitive performance. Overall, LangBridge enables interpretable vision-language alignment by grounding visual representations in LLM vocab embedding, while its plug-and-play design ensures efficient reuse across multiple LLMs with nearly no performance degradation. See our project page at https://curryx-001.github.io/LangBridge.github.io/

LangBridge: Interpreting Image as a Combination of Language Embeddings

TL;DR

LangBridge addresses how LVLMs bridge the vision-language gap by showing that MLPs progressively project visual features into subspaces spanned by corresponding text embeddings. It introduces Language Basis Vector Projection, representing visual embeddings as linear combinations of LLM vocabulary embeddings, enabling a pretraining-free adapter that can be reused across different LLMs. Empirical results demonstrate competitive performance with standard MLPs and robust cross-architecture transfer across Qwen and LLaMA families, with notable efficiency gains by avoiding repetitive pretraining. The work provides an interpretable grounding of visual information in vocabulary space and offers a practical pathway to scalable, multi-LLM LVLM deployments.

Abstract

Recent years have witnessed remarkable advances in Large Vision-Language Models (LVLMs), which have achieved human-level performance across various complex vision-language tasks. Following LLaVA's paradigm, mainstream LVLMs typically employ a shallow MLP for visual-language alignment through a two-stage training process: pretraining for cross-modal alignment followed by instruction tuning. While this approach has proven effective, the underlying mechanisms of how MLPs bridge the modality gap remain poorly understood. Although some research has explored how LLMs process transformed visual tokens, few studies have investigated the fundamental alignment mechanism. Furthermore, the MLP adapter requires retraining whenever switching LLM backbones. To address these limitations, we first investigate the working principles of MLP adapters and discover that they learn to project visual embeddings into subspaces spanned by corresponding text embeddings progressively. Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings. This innovative design enables pretraining-free adapter transfer across different LLMs while maintaining performance. Our experimental results demonstrate that a LangBridge adapter pre-trained on Qwen2-0.5B can be directly applied to larger models such as LLaMA3-8B or Qwen2.5-14B while maintaining competitive performance. Overall, LangBridge enables interpretable vision-language alignment by grounding visual representations in LLM vocab embedding, while its plug-and-play design ensures efficient reuse across multiple LLMs with nearly no performance degradation. See our project page at https://curryx-001.github.io/LangBridge.github.io/

Paper Structure

This paper contains 28 sections, 7 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Comparison of different connector types in LVLM: (a) The MLP directly maps visual features into the LLM’s text embedding space. (b) The Ovis method uses a visual embedding table to produce structural visual embeddings and align the modalities. (c) LangBridge decomposes visual features into weighted combinations of LLM’s vocabulary vectors to form the visual embeddings
  • Figure 2: Progressive semantic alignment in MLP adapters across training stages. Circular graphs demonstrate the evolution of visual-text token associations through four training phases (Pretrain-100, Pretrain-1000, Pretrain-2000, and Final-SFT). (a) For a bunch of apples image, the MLP progressively refines associations from the meaningless text ("numerous", "Arts") to meaningful text ("five", "Green Apple"). (b) For a sunset silhouette scene and poem, semantic mappings evolve from the meaningless text ("We Plat", "Kiss") to contextually relevant tokens ("Shakespeare", "Wed Kiss"), illustrating the MLP's increasing capability to project visual features into LLM's text embedding space.
  • Figure 3: Overview of LangBridge architecture and workflow. (a) The architecture of LangBridge: LangBridge first extracts visual features through a Vision Encoder, then transforms them into visual embeddings by decomposing them into linear combinations of LLM's vocabulary embeddings. These visual embeddings are then concatenated with text embeddings for LLM processing. (b) Linear Combination: LangBridge generates probability distributions over the vocabulary and multiplies them with text embeddings to form visual embeddings. (c) Vocabulary Embedding Selection: A subset of shared vocabulary embeddings is selected to reduce parameter count and optimization complexity while enabling cross-LLM reuse. (d) Adapter Reuse: LangBridge shows cross-LLM adaptability, allowing an adapter pre-trained on one LLM to be reused on a different LLM during the SFT stage.