Table of Contents
Fetching ...

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Chenxin Tao, Shiqian Su, Xizhou Zhu, Chenyu Zhang, Zhe Chen, Jiawen Liu, Wenhai Wang, Lewei Lu, Gao Huang, Yu Qiao, Jifeng Dai

TL;DR

HoVLE tackles the gap between monolithic and compositional vision-language models by introducing a Holistic Vision-Language Embedding that maps images and text into a unified space for an LLM to process. It avoids tuning the pre-trained LLM through a multi-stage training pipeline: distillation on unpaired data to imbue the embedding with vision and language cues, followed by alignment with a frozen LLM and an instruction-tuning phase. Empirical results across 17 multi-modal benchmarks show HoVLE is competitive with state-of-the-art compositional approaches and substantially outperforms prior monolithic methods, with the HD variant delivering stronger VQA performance. The approach demonstrates the feasibility of high-performance monolithic VLMs using unpaired data and structured distillation, paving the way for scalable, language-preserving multimodal models.

Abstract

The rapid advance of Large Language Models (LLMs) has catalyzed the development of Vision-Language Models (VLMs). Monolithic VLMs, which avoid modality-specific encoders, offer a promising alternative to the compositional ones but face the challenge of inferior performance. Most existing monolithic VLMs require tuning pre-trained LLMs to acquire vision abilities, which may degrade their language capabilities. To address this dilemma, this paper presents a novel high-performance monolithic VLM named HoVLE. We note that LLMs have been shown capable of interpreting images, when image embeddings are aligned with text embeddings. The challenge for current monolithic VLMs actually lies in the lack of a holistic embedding module for both vision and language inputs. Therefore, HoVLE introduces a holistic embedding module that converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts. Furthermore, a multi-stage training strategy is carefully designed to empower the holistic embedding module. It is first trained to distill visual features from a pre-trained vision encoder and text embeddings from the LLM, enabling large-scale training with unpaired random images and text tokens. The whole model further undergoes next-token prediction on multi-modal data to align the embeddings. Finally, an instruction-tuning stage is incorporated. Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks, outperforming previous monolithic models by a large margin. Model available at https://huggingface.co/OpenGVLab/HoVLE.

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

TL;DR

HoVLE tackles the gap between monolithic and compositional vision-language models by introducing a Holistic Vision-Language Embedding that maps images and text into a unified space for an LLM to process. It avoids tuning the pre-trained LLM through a multi-stage training pipeline: distillation on unpaired data to imbue the embedding with vision and language cues, followed by alignment with a frozen LLM and an instruction-tuning phase. Empirical results across 17 multi-modal benchmarks show HoVLE is competitive with state-of-the-art compositional approaches and substantially outperforms prior monolithic methods, with the HD variant delivering stronger VQA performance. The approach demonstrates the feasibility of high-performance monolithic VLMs using unpaired data and structured distillation, paving the way for scalable, language-preserving multimodal models.

Abstract

The rapid advance of Large Language Models (LLMs) has catalyzed the development of Vision-Language Models (VLMs). Monolithic VLMs, which avoid modality-specific encoders, offer a promising alternative to the compositional ones but face the challenge of inferior performance. Most existing monolithic VLMs require tuning pre-trained LLMs to acquire vision abilities, which may degrade their language capabilities. To address this dilemma, this paper presents a novel high-performance monolithic VLM named HoVLE. We note that LLMs have been shown capable of interpreting images, when image embeddings are aligned with text embeddings. The challenge for current monolithic VLMs actually lies in the lack of a holistic embedding module for both vision and language inputs. Therefore, HoVLE introduces a holistic embedding module that converts visual and textual inputs into a shared space, allowing LLMs to process images in the same way as texts. Furthermore, a multi-stage training strategy is carefully designed to empower the holistic embedding module. It is first trained to distill visual features from a pre-trained vision encoder and text embeddings from the LLM, enabling large-scale training with unpaired random images and text tokens. The whole model further undergoes next-token prediction on multi-modal data to align the embeddings. Finally, an instruction-tuning stage is incorporated. Our experiments show that HoVLE achieves performance close to leading compositional models on various benchmarks, outperforming previous monolithic models by a large margin. Model available at https://huggingface.co/OpenGVLab/HoVLE.

Paper Structure

This paper contains 15 sections, 5 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Performance comparison on different benchmarks between compositional VLMs (dashed lines) and monolithic VLMs (solid lines). Previous monolithic VLMs exhibit a significant performance gap compared to compositional VLMs, while Our HoVLE demonstrate competitive capabilities with state-of-the-art compositional VLMs.
  • Figure 2: Comparison of VLM architectures. (a) Compositional VLMs integrate pre-trained vision encoders with LLMs, using an extra connector to align image and text embeddings. (b) Existing Monolithic VLMs directly feed image and text inputs into LLMs, which require continual pre-training to gain visual abilities. (c) HoVLE uses a holistic embedding module to project image and text input to a unified embedding space, enabling LLMs to interpret images in a text-like manner. Blocks with the same color have the same Transformer layer architecture.
  • Figure 3: (a) The architecture of HoVLE. HoVLE initially segments the input images into patches dynamically and tokenizes input texts. The holistic embedding module then projects them into a unified space. Finally, the LLM processes these unified embeddings to produce the final outputs. (b) The training strategies of HoVLE.Distillation stage trains the holistic embedding to distill a pre-trained vision encoder and text embeddings of the LLM using unpaired random images and texts. Alignment stage combines the holistic embedding module with a frozen LLM, conducting auto-regressive training to align the vision-language embeddings. Instruction tuning further enhance HoVLE's overall ability by tuning the whole model.
  • Figure 4: Distillation data scaling performance.
  • Figure 5: Attention Maps for EVE, Emu3, InternVL2 and our HoVLE at the first and last layers of LLM backbones. Y-axis represents query tokens, and X-axis represents key tokens, with text modality tokens in gray and image modality tokens in yellow. All four models share the same input, but the sequence lengths of input tokens are different due to different image pre-processing. We highlight text-to-image attention below each full attention map. Our HoVLE, like the compositional InternVL2, has sparse attention across all network layers, while other monolithic models Emu3 and EVE have denser attention in shallow layers.
  • ...and 2 more figures