Table of Contents
Fetching ...

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, Juncheng Li, Siliang Tang, Yueting Zhuang

TL;DR

HyperLLaVA is introduced, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively, derived from HyperNetworks, enabling dynamic projector and LLM modeling in two-stage training.

Abstract

Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.

HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

TL;DR

HyperLLaVA is introduced, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively, derived from HyperNetworks, enabling dynamic projector and LLM modeling in two-stage training.

Abstract

Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.
Paper Structure (14 sections, 7 equations, 4 figures, 6 tables)

This paper contains 14 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: (a) is the overview of LLaVA. (b) describes the simplified version of our HyperLLaVA. (c) shows that compared to LLaVA, our method achieves superior performance across different MLLM benchmarks.
  • Figure 2: Overview of proposed HyperLLaVA. (a) describes how the proposed visual expert assists the static projector that dynamically converts the image features to adaptive visual tokens, yielding an augmented visual expression for subsequent instruction tuning. (b) is the proposed language expert-integrated tuning, which uses the output of the intermediate layer as language guidance to generate dynamic instruction-specific feature, increasing the flexibility for processing different multimodal tasks.
  • Figure 3: Selected blocks for language guidance.
  • Figure 4: Performance with respect to the different input and downsampling dimension in expert.