Table of Contents
Fetching ...

Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks

Sungkyung Kim, Adam Lee, Junyoung Park, Andrew Chung, Jusang Oh, Jay-Yoon Lee

TL;DR

The paper addresses efficient visual-language alignment by applying parameter-efficient fine-tuning (PEFT) to the Q-Former within InstructBLIP and analyzing sublayer importance with AdaLoRA. It demonstrates that LoRA-based fine-tuning of the Q-Former can match full fine-tuning performance while using under 2% of trainable parameters, and that jointly applying PEFT to both the Q-Former and the LLM yields further gains with under 12% trainable parameters. AdaLoRA analysis reveals that self-attention layers are most critical for perceptual visual-language reasoning, while FFN contributions increase with task complexity, offering guidance on dynamic budget allocation across sublayers. The findings promote efficient multimodal training and provide practical insights for parameter budgeting in visual reasoning systems, with code available at the authors' GitHub repository.

Abstract

Recent advancements in large language models have demonstrated enhanced capabilities in visual reasoning tasks by employing additional encoders for aligning different modalities. While the Q-Former has been widely used as a general encoder for aligning several modalities including image, video, audio, and 3D with large language models, previous works on its efficient training and the analysis of its individual components have been limited. In this work, we investigate the effectiveness of parameter efficient fine-tuning (PEFT) the Q-Former using InstructBLIP with visual reasoning benchmarks ScienceQA and IconQA. We observe that applying PEFT to the Q-Former achieves comparable performance to full fine-tuning using under 2% of the trainable parameters. Additionally, we employ AdaLoRA for dynamic parameter budget reallocation to examine the relative importance of the Q-Former's sublayers with 4 different benchmarks. Our findings reveal that the self-attention layers are noticeably more important in perceptual visual-language reasoning tasks, and relative importance of FFN layers depends on the complexity of visual-language patterns involved in tasks. The code is available at https://github.com/AttentionX/InstructBLIP_PEFT.

Towards Efficient Visual-Language Alignment of the Q-Former for Visual Reasoning Tasks

TL;DR

The paper addresses efficient visual-language alignment by applying parameter-efficient fine-tuning (PEFT) to the Q-Former within InstructBLIP and analyzing sublayer importance with AdaLoRA. It demonstrates that LoRA-based fine-tuning of the Q-Former can match full fine-tuning performance while using under 2% of trainable parameters, and that jointly applying PEFT to both the Q-Former and the LLM yields further gains with under 12% trainable parameters. AdaLoRA analysis reveals that self-attention layers are most critical for perceptual visual-language reasoning, while FFN contributions increase with task complexity, offering guidance on dynamic budget allocation across sublayers. The findings promote efficient multimodal training and provide practical insights for parameter budgeting in visual reasoning systems, with code available at the authors' GitHub repository.

Abstract

Recent advancements in large language models have demonstrated enhanced capabilities in visual reasoning tasks by employing additional encoders for aligning different modalities. While the Q-Former has been widely used as a general encoder for aligning several modalities including image, video, audio, and 3D with large language models, previous works on its efficient training and the analysis of its individual components have been limited. In this work, we investigate the effectiveness of parameter efficient fine-tuning (PEFT) the Q-Former using InstructBLIP with visual reasoning benchmarks ScienceQA and IconQA. We observe that applying PEFT to the Q-Former achieves comparable performance to full fine-tuning using under 2% of the trainable parameters. Additionally, we employ AdaLoRA for dynamic parameter budget reallocation to examine the relative importance of the Q-Former's sublayers with 4 different benchmarks. Our findings reveal that the self-attention layers are noticeably more important in perceptual visual-language reasoning tasks, and relative importance of FFN layers depends on the complexity of visual-language patterns involved in tasks. The code is available at https://github.com/AttentionX/InstructBLIP_PEFT.

Paper Structure

This paper contains 11 sections, 1 equation, 10 figures, 1 table.

Figures (10)

  • Figure 1: The detailed structure of the Q-Former with AdaLoRA weight matrices (B, E, A).
  • Figure 2: Comparing the performance and number of trainable parameters using Flan-T5-XL and Vicuna-7B as base models on ScienceQA and IconQA benchmarks. This compares the best performing configurations (rank value and LoRA-applied sublayers) of Q-Former full fine-tuning with LLM PEFT, Q-Former PEFT with frozen LLM, and Q-Former PEFT with LLM PEFT, against InstructBLIP (Q-Former full fine-tuning with frozen LLM). "QF" denotes Q-Former. "FFT" denotes full fine-tuning. The complete results and the training architectures are at Appendix \ref{['appendix-a']}.
  • Figure 3: Heatmaps of the rank distributions of the sublayers in the Q-Former. Cross-attention layers are present in odd numbered layers only. Each value is the average of the component layers. The detailed heatmaps including additional benchmarks (Flickr30k, Vizwiz) are in Appendix \ref{['appendix-e']}.
  • Figure 4: Example ScienceQA instruction template
  • Figure 5: Example IconQA instruction template
  • ...and 5 more figures