Table of Contents
Fetching ...

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

Junfei Xiao, Zheng Xu, Alan Yuille, Shen Yan, Boyu Wang

TL;DR

The paper tackles the efficiency challenge of vision-language models by freezing vision encoders and LLMs and introducing a progressively aligned PaLM2-VAdapter as the cross-modal bridge. By training a tiny PaLM2 model in two stages—first as a decoder, then as the adapter with a 1-layer perceiver resampler—it achieves faster convergence, higher performance, and better scalability than strong perceiver-based baselines, using 30–70% fewer trainable parameters. Extensive experiments on image/video captioning and VQA demonstrate state-of-the-art or competitive results across COCO, MSRVTT, VQAv2, VizWiz, OKVQA, and related benchmarks, with notable parameter efficiency and scalable gains as encoders and LLMs grow. The work highlights the importance of adapter training strategies for multi-modal alignment and suggests future directions around visual-to-language token quantization and broader modality integration.

Abstract

This paper demonstrates that a progressively aligned language model can effectively bridge frozen vision encoders and large language models (LLMs). While the fundamental architecture and pre-training methods of vision encoders and LLMs have been extensively studied, the architecture and training strategy of vision-language adapters vary significantly across recent works. Our research undertakes a thorough exploration of the state-of-the-art perceiver resampler architecture and builds a strong baseline. However, we observe that the vision-language alignment with perceiver resampler exhibits slow convergence and limited scalability with a lack of direct supervision. To address this issue, we propose PaLM2-VAdapter, employing a progressively aligned language model as the vision-language adapter. Compared to the strong baseline with perceiver resampler, our method empirically shows faster convergence, higher performance, and stronger scalability. Extensive experiments across various Visual Question Answering (VQA) and captioning tasks on both images and videos demonstrate that our model exhibits state-of-the-art visual understanding and multi-modal reasoning capabilities. Notably, our method achieves these advancements with 30~70% fewer parameters than the state-of-the-art large vision-language models, marking a significant efficiency improvement.

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

TL;DR

The paper tackles the efficiency challenge of vision-language models by freezing vision encoders and LLMs and introducing a progressively aligned PaLM2-VAdapter as the cross-modal bridge. By training a tiny PaLM2 model in two stages—first as a decoder, then as the adapter with a 1-layer perceiver resampler—it achieves faster convergence, higher performance, and better scalability than strong perceiver-based baselines, using 30–70% fewer trainable parameters. Extensive experiments on image/video captioning and VQA demonstrate state-of-the-art or competitive results across COCO, MSRVTT, VQAv2, VizWiz, OKVQA, and related benchmarks, with notable parameter efficiency and scalable gains as encoders and LLMs grow. The work highlights the importance of adapter training strategies for multi-modal alignment and suggests future directions around visual-to-language token quantization and broader modality integration.

Abstract

This paper demonstrates that a progressively aligned language model can effectively bridge frozen vision encoders and large language models (LLMs). While the fundamental architecture and pre-training methods of vision encoders and LLMs have been extensively studied, the architecture and training strategy of vision-language adapters vary significantly across recent works. Our research undertakes a thorough exploration of the state-of-the-art perceiver resampler architecture and builds a strong baseline. However, we observe that the vision-language alignment with perceiver resampler exhibits slow convergence and limited scalability with a lack of direct supervision. To address this issue, we propose PaLM2-VAdapter, employing a progressively aligned language model as the vision-language adapter. Compared to the strong baseline with perceiver resampler, our method empirically shows faster convergence, higher performance, and stronger scalability. Extensive experiments across various Visual Question Answering (VQA) and captioning tasks on both images and videos demonstrate that our model exhibits state-of-the-art visual understanding and multi-modal reasoning capabilities. Notably, our method achieves these advancements with 30~70% fewer parameters than the state-of-the-art large vision-language models, marking a significant efficiency improvement.
Paper Structure (27 sections, 5 figures, 13 tables)

This paper contains 27 sections, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Faster, higher, and stronger. Our progressively aligned language model demonstrates faster convergence, higher performance and stronger scalability as an adapter for vision-language alignment.
  • Figure 2: Method overview.(a): The classic model framework for visual-language alignment, consisting of three major parts: a vision encoder, an adapter and a LLM decoder. (b): Our progressive alignment strategy of our PaLM2-VAdapter. (i) A tiny PaLM2 language model ($\sim$108M) is trained as the LM decoder in the first stage and (ii) then trained as the vision-language adapter (with an addition 1-layer perceiver resampler) for aligning the same vision encoder and a large PaLM2 decoder.
  • Figure 3: Qualitative examples of Visual Captioning.Left: Image captioning on the COCO dataset. Right: Video captioning on the MSRVTT dataset. PaLM2-VAdapter demonstrates strong visual understanding ability.
  • Figure 4: Qualitative examples of Visual Question Answering.Left: Image question answering on the VQAv2 dataset. Right: video question answering on the MSVD-QA dataset.
  • Figure 5: Additional qualitative examples.Top left: Image captioning on the COCO dataset. Top right: Video captioning on the MSRVTT dataset. Bottom left: Image question answering on the VQAv2 dataset. Bottom right: Video question answering on the MSVD-QA dataset.