Table of Contents
Fetching ...

Libra: Building Decoupled Vision System on Large Language Models

Yifan Xu, Xiaoshan Yang, Yaguang Song, Changsheng Xu

TL;DR

Libra tackles the challenge of integrating vision with large language models by proposing a decoupled vision system that preserves visual specificity while enabling cross-modal comprehension. It introduces a routed visual expert and a cross-modal bridge built on top of a frozen LLM, coupled with discrete auto-regressive vision modeling and a CLIP-based LFQ image tokenizer. Trained on a relatively small dataset (50M image-text pairs), Libra achieves competitive image-to-text performance and strong zero-shot capabilities across VQA and captioning benchmarks, while displaying diverse attention patterns and reduced learning redundancy. The approach suggests that maintaining separate visual representations and carefully designed cross-modal interaction is a promising path for scalable multimodal foundation models.

Abstract

In this work, we introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM). The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension. Libra is trained through discrete auto-regressive modeling on both vision and language inputs. Specifically, we incorporate a routed visual expert with a cross-modal bridge module into a pretrained LLM to route the vision and language flows during attention computing to enable different attention patterns in inner-modal modeling and cross-modal interaction scenarios. Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future multimodal foundation models. Code is available at https://github.com/YifanXu74/Libra.

Libra: Building Decoupled Vision System on Large Language Models

TL;DR

Libra tackles the challenge of integrating vision with large language models by proposing a decoupled vision system that preserves visual specificity while enabling cross-modal comprehension. It introduces a routed visual expert and a cross-modal bridge built on top of a frozen LLM, coupled with discrete auto-regressive vision modeling and a CLIP-based LFQ image tokenizer. Trained on a relatively small dataset (50M image-text pairs), Libra achieves competitive image-to-text performance and strong zero-shot capabilities across VQA and captioning benchmarks, while displaying diverse attention patterns and reduced learning redundancy. The approach suggests that maintaining separate visual representations and carefully designed cross-modal interaction is a promising path for scalable multimodal foundation models.

Abstract

In this work, we introduce Libra, a prototype model with a decoupled vision system on a large language model (LLM). The decoupled vision system decouples inner-modal modeling and cross-modal interaction, yielding unique visual information modeling and effective cross-modal comprehension. Libra is trained through discrete auto-regressive modeling on both vision and language inputs. Specifically, we incorporate a routed visual expert with a cross-modal bridge module into a pretrained LLM to route the vision and language flows during attention computing to enable different attention patterns in inner-modal modeling and cross-modal interaction scenarios. Experimental results demonstrate that the dedicated design of Libra achieves a strong MLLM baseline that rivals existing works in the image-to-text scenario with merely 50 million training data, providing a new perspective for future multimodal foundation models. Code is available at https://github.com/YifanXu74/Libra.
Paper Structure (23 sections, 10 equations, 7 figures, 6 tables)

This paper contains 23 sections, 10 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Libra investigates a decoupled vision system on the pretrained LLM. The vision system is built with a routed visual expert design. We train Libra through discrete auto-regressive modeling. The vision inputs consist of a hybrid of contiguous signals from the vision encoder and discrete "word" embeddings constructed based on the tokenized ids. $\texttt{<EOS>}$ is the end-of-sequence token. In practice, the discrete ids are used to construct discrete vision embeddings from a codebook learned by auto-regressive image modeling of Libra.
  • Figure 2: Image reconstruction results. Directly replacing the image encoder of VQGAN with CLIP distorts the visual information. Libra largely alleviates this problem via lookup-free quantization.
  • Figure 3: Results of visual sequential modeling.
  • Figure 4: Attention patterns across layers. (a) Attention activation of single-word answers on images. (b) Cross-layer attention difference: the difference between each layer's attention score (averaged across all heads) and the mean value of all layers, averaged along the spatial dimension. (c) Inner-layer attention difference: the difference between each head's attention score and the mean value of all heads in each layer, averaged along the spatial dimension. The implementation details can be found in Sec. \ref{['sec:attn_diff']}.
  • Figure 5: Image reconstruction results of the image tokenizers in Libra and DALL-E 2 dalle2.
  • ...and 2 more figures