Table of Contents
Fetching ...

Semantically Grounded QFormer for Efficient Vision Language Understanding

Moulik Choraria, Xinbo Wu, Sourya Basu, Nitesh Sekhar, Yue Wu, Xu Zhang, Prateek Singhal, Lav R. Varshney

TL;DR

This work investigates efficiency bottlenecks in QFormer-based vision-language models and proposes grounding the QFormer by conditioning the LLM latent space with language-grounded prompts. By augmenting the QFormer with encoded prompts and focusing on accessing the LLM latent representations rather than the initial input embeddings, the approach improves learning efficiency and yields better single-task and multi-task performance, while maintaining competitive zero-shot results with smaller LLMs. The authors provide analyses of QFormer representations and cross-modal alignment, demonstrate empirical gains in captioning and VQA, and report pretraining-time advantages, with a clear path toward extending to decoder-only LLMs and scaling pretraining. The proposed framework offers a practical route to more efficient, scalable vision-language modeling and sets the stage for larger-scale experiments with web-scale data and bigger LLMs.

Abstract

General purpose Vision Language Models (VLMs) have received tremendous interest in recent years, owing to their ability to learn rich vision-language correlations as well as their broad zero-shot competencies. One immensely popular line of work utilizes frozen unimodal models, by bridging vision representations to language using a trainable module called the QFormer. However, this method relies heavily on large-scale multimodal pretraining with huge computational overheads. To that end, we propose a more efficient framework for QFormer-based vision-language alignment. Our key idea relies on the observation that QFormer latents correspond more strongly to the frozen LLM's intermediate latent space. Consequently, instead of using QFormer latents as inputs to the LLM, we alter the framework by using the latents to directly condition the LLM latent space for image-to-text generation. We demonstrate the effectiveness of our approach against existing baselines in improving the efficiency of vision-language pretraining.

Semantically Grounded QFormer for Efficient Vision Language Understanding

TL;DR

This work investigates efficiency bottlenecks in QFormer-based vision-language models and proposes grounding the QFormer by conditioning the LLM latent space with language-grounded prompts. By augmenting the QFormer with encoded prompts and focusing on accessing the LLM latent representations rather than the initial input embeddings, the approach improves learning efficiency and yields better single-task and multi-task performance, while maintaining competitive zero-shot results with smaller LLMs. The authors provide analyses of QFormer representations and cross-modal alignment, demonstrate empirical gains in captioning and VQA, and report pretraining-time advantages, with a clear path toward extending to decoder-only LLMs and scaling pretraining. The proposed framework offers a practical route to more efficient, scalable vision-language modeling and sets the stage for larger-scale experiments with web-scale data and bigger LLMs.

Abstract

General purpose Vision Language Models (VLMs) have received tremendous interest in recent years, owing to their ability to learn rich vision-language correlations as well as their broad zero-shot competencies. One immensely popular line of work utilizes frozen unimodal models, by bridging vision representations to language using a trainable module called the QFormer. However, this method relies heavily on large-scale multimodal pretraining with huge computational overheads. To that end, we propose a more efficient framework for QFormer-based vision-language alignment. Our key idea relies on the observation that QFormer latents correspond more strongly to the frozen LLM's intermediate latent space. Consequently, instead of using QFormer latents as inputs to the LLM, we alter the framework by using the latents to directly condition the LLM latent space for image-to-text generation. We demonstrate the effectiveness of our approach against existing baselines in improving the efficiency of vision-language pretraining.
Paper Structure (15 sections, 3 equations, 8 figures, 4 tables)

This paper contains 15 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Experiment to study ease of modeling input embeddings vs encoder representations, the lower error indicates that the QFormer (BERT init) finds it easier to model the latter.
  • Figure 2: Standard QFormer
  • Figure 3: Grounded QFormer (Ours)
  • Figure 5: Mutual KNN alignment scores across the layers of an LLM (flanT5-base) and a vision transformer (Eva-clip-g/14) are presented as a heat map. The x-axis represents the layer IDs of the vision transformer, while the y-axis represents the layer IDs of the LLM. The maximum score is highlighted with a red box.
  • Figure 6: Pretraining comparison, showcasing our framework significantly improves pretraining efficiency.
  • ...and 3 more figures