Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning
Vedanshu, MM Tripathi, Bhavnesh Jaint
TL;DR
This work tackles the efficiency barrier in vision-language large language models by proposing Bottleneck Adapter (BA) modules and Multimodal Model Tuning (MMT) to enable end-to-end training with a small parameter budget. The BA uses a group-wise dense projection scheme with the form $g(Z)=Z+\psi_{u}(\psi_{d}(Z))$, enabling a lightweight yet effective multimodal integration, and is placed before Transformer blocks following the LAVIN design. The model combines LLaMA-2 as the language backbone with a DIHT-ViT image encoder, using a visual neck of dimension $128$ and an adapter dimension of $16$, with the visual-to-language transform $V' = \rho(VW_v + b_v)W_t + b_t$ and multimodal input $Z=[u_m, V', T]$ for next-token prediction $p_t = \prod_{s=1}^{S+1} p(Q_s | Z, Q_{0:s-1}; \theta_m, \theta_r)$. Empirically, the approach achieves 90.12% average accuracy on ScienceQA, closely approaching or surpassing human performance in several categories, while reducing memory and enabling training on a single GPU; ablations identify the 16-dim, 2-group BA as a sweet spot and show that dynamic weight factors are not beneficial for multimodal fine-tuning. Overall, the method demonstrates a practical, scalable pathway for efficient VL-LMMs with competitive multimodal reasoning capabilities and broad applicability to text-only and image-text tasks.
Abstract
The integration of large language models (LLMs) with vision-language (VL) tasks has been a transformative development in the realm of artificial intelligence, highlighting the potential of LLMs as a versatile general-purpose chatbot. However, the current trend in this evolution focuses on the integration of vision and language to create models that can operate in more diverse and real-world contexts. We present a novel approach, termed Bottleneck Adapter, specifically crafted for enhancing the multimodal functionalities of these complex models, enabling joint optimization of the entire multimodal LLM framework through a process known as Multimodal Model Tuning (MMT). Our approach utilizes lightweight adapters to connect the image encoder and LLM without the need for large, complex neural networks. Unlike the conventional modular training schemes, our approach adopts an end-to-end optimization regime, which, when combined with the adapters, facilitates the joint optimization using a significantly smaller parameter set. Our method exhibits robust performance with 90.12\% accuracy, outperforming both human-level performance (88.4\%) and LaVIN-7B (89.41\%).
