Table of Contents
Fetching ...

Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation

Jitesh Jain, Zhengyuan Yang, Humphrey Shi, Jianfeng Gao, Jianwei Yang

TL;DR

The work addresses the challenge that multimodal LLMs often underutilize rich visual perception signals when trained with natural language supervision alone. VisPer-LM introduces a vision-centric pretraining approach that distills knowledge from expert vision encoders into the LLM’s hidden representations via predictive embedding optimization, while using a single encoder at inference. Through extensive probing and experiments, VisPer-LM demonstrates improved visual perception abilities, achieving gains on CV-Bench tasks (up to 8.7% depth) and outperforming both single- and multi-encoder baselines across diverse benchmarks. This approach enables stronger spatial/depth reasoning in MLLMs without proportional increases in training data or inference latency, offering a practical path toward more capable embodied AI systems. The work also highlights the value of probing internal representations to guide architecture- and objective-level design choices for vision-language models.

Abstract

In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data, which are critical for tasks involving spatial reasoning in the domain of embodied AI and robotics. Is it possible to optimize both at the same time? In this work, we propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's (of an MLLM) hidden representations. We start by investigating MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Given this insight, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next (text) token prediction. Moreover, through extensive probing, we observe improved visual representation quality due to embedding optimization, underscoring the effectiveness of our probing setup. We demonstrate that our VisPer-LM outperforms the single and multi-encoder baselines, proving our approach's superiority over explicitly feeding the corresponding features to the LLM. In particular, VisPer-LM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.

Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation

TL;DR

The work addresses the challenge that multimodal LLMs often underutilize rich visual perception signals when trained with natural language supervision alone. VisPer-LM introduces a vision-centric pretraining approach that distills knowledge from expert vision encoders into the LLM’s hidden representations via predictive embedding optimization, while using a single encoder at inference. Through extensive probing and experiments, VisPer-LM demonstrates improved visual perception abilities, achieving gains on CV-Bench tasks (up to 8.7% depth) and outperforming both single- and multi-encoder baselines across diverse benchmarks. This approach enables stronger spatial/depth reasoning in MLLMs without proportional increases in training data or inference latency, offering a practical path toward more capable embodied AI systems. The work also highlights the value of probing internal representations to guide architecture- and objective-level design choices for vision-language models.

Abstract

In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data, which are critical for tasks involving spatial reasoning in the domain of embodied AI and robotics. Is it possible to optimize both at the same time? In this work, we propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's (of an MLLM) hidden representations. We start by investigating MLLMs trained solely with natural language supervision and identify a positive correlation between the quality of visual representations within these models and their downstream performance. Given this insight, we formulate the objective during the pretraining stage in MLLMs as a coupled optimization of predictive visual embedding and next (text) token prediction. Moreover, through extensive probing, we observe improved visual representation quality due to embedding optimization, underscoring the effectiveness of our probing setup. We demonstrate that our VisPer-LM outperforms the single and multi-encoder baselines, proving our approach's superiority over explicitly feeding the corresponding features to the LLM. In particular, VisPer-LM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.

Paper Structure

This paper contains 20 sections, 4 equations, 16 figures, 19 tables.

Figures (16)

  • Figure 1: Different Paradigms for Incorporating Visual Information into LLMs.(a, b) Existing approaches liu2023improvedllavatong2024cambrian1 feed features from the visual encoder(s) into the LLM and train the model solely with natural language supervision, i.e., next token prediction (NTP) to align the embedding space of the vision encoder(s) and the LLM. (c) We propose distilling target visual information into the intermediate representations of the LLM from a set of auxiliary vision encoders ($\mathbf{E}^\text{target}$). We adopt a predictive embedding jepa optimization approach at selected LLM layers during training to minimize the embedding losses and the NTP loss function, resulting in a vision-centric approach to training the Multimodal Large Language Model. We only use a single base vision encoder during inference.
  • Figure 2: Probing reveals a positive correlation between depth representation quality and performance on CV-Bench.(a) Increasing training data and using only the next-token prediction objective improves visual representation quality in the LLM, as well as downstream performance. (b) Our method, VisPer-LM, outperforms LLaVA-1.5 liu2023improvedllava in both probing and downstream tasks under the same settings.
  • Figure 3: Probing Visual Representation Quality across LLM layers in MLLMs. (1) As shown in the first row, the multi-encoder baseline has the best probing performance owing to the additional feature inputs. The performance of probes trained on our VisPer-LM falls between the two baselines, demonstrating the effectiveness of our embedding distillation approach in learning an improved projector while only using a single encoder during inference. (2) We observe that the probing performance for single-encoder models trained solely with natural language supervision improves as the training data of the base MLLM increases, indicating that the LLM improves its visual representations of the world with more training data. In the last row, we observe that our VisPer-LM (base setting) outperforms LLaVA-1.5 trained with more data during the PT stage, demonstrating the effectiveness of our approach with limited (data/compute) resources.
  • Figure 4: Architecture for VisPer-LM. During Pre-Training (PT), we optimize an embedding loss at specific layers for each target encoder: layers $d \in \mathbb{D}$, $s \in \mathbb{S}$, and $g \in \mathbb{G}$ for the depth, segmentation, and generation tasks, respectively. We use a resampler-based embedding predictor perceiver, denoted as $\mathbf{P}^{l}_{\{\text{task}\}}$ at each layer $l$, to output predictions. Each predictor takes in two inputs: a set of learnable queries ($\mathbf{Q}^{{\text{task}}}$) and the token sequence from layer $l$, with special tokens for other tasks omitted. The task tokens are derived from the corresponding embedding predictor's learnable queries. During IFT, we train with only the next-token prediction objective while keeping the special tokens frozen to not affect their nature as we found it to perform empirically better in \ref{['sec:exp']}.
  • Figure I: Visualizing Embedding Predictor Outputs after the PT stage. The quality of the decoded representations indicates the effectiveness of our embedding optimization.
  • ...and 11 more figures