Table of Contents
Fetching ...

HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models

Yimu Wang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki

TL;DR

HAWAII addresses the inefficiency of multi-expert vision-language models by distilling knowledge from multiple pretrained visual experts into a single vision encoder. It introduces a mixture-of-LoRA-adapters (MoLA) with teacher-specific and general adapters, guided by a hierarchical knowledge distillation (HKD) framework that operates at coarse- and fine-grained levels, including token importance scoring. The approach yields consistent performance gains over baselines across a broad set of vision-language benchmarks while maintaining low computational overhead. This work enables scalable, high-quality visual perception in VLMs and provides a practical pathway to integrate diverse visual knowledge with modest resource costs.

Abstract

Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII compared to popular open-source VLMs. The code is available at https://github.com/yimuwangcs/wise-hawaii.

HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models

TL;DR

HAWAII addresses the inefficiency of multi-expert vision-language models by distilling knowledge from multiple pretrained visual experts into a single vision encoder. It introduces a mixture-of-LoRA-adapters (MoLA) with teacher-specific and general adapters, guided by a hierarchical knowledge distillation (HKD) framework that operates at coarse- and fine-grained levels, including token importance scoring. The approach yields consistent performance gains over baselines across a broad set of vision-language benchmarks while maintaining low computational overhead. This work enables scalable, high-quality visual perception in VLMs and provides a practical pathway to integrate diverse visual knowledge with modest resource costs.

Abstract

Improving the visual understanding ability of vision-language models (VLMs) is crucial for enhancing their performance across various tasks. While using multiple pretrained visual experts has shown great promise, it often incurs significant computational costs during training and inference. To address this challenge, we propose HAWAII, a novel framework that distills knowledge from multiple visual experts into a single vision encoder, enabling it to inherit the complementary strengths of several experts with minimal computational overhead. To mitigate conflicts among different teachers and switch between different teacher-specific knowledge, instead of using a fixed set of adapters for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation (LoRA) adapters with a corresponding router. Each adapter is aligned with a specific teacher, avoiding noisy guidance during distillation. To enable efficient knowledge distillation, we propose fine-grained and coarse-grained distillation. At the fine-grained level, token importance scores are employed to emphasize the most informative tokens from each teacher adaptively. At the coarse-grained level, we summarize the knowledge from multiple teachers and transfer it to the student using a set of general-knowledge LoRA adapters with a router. Extensive experiments on various vision-language tasks demonstrate the superiority of HAWAII compared to popular open-source VLMs. The code is available at https://github.com/yimuwangcs/wise-hawaii.

Paper Structure

This paper contains 21 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The overall architecture of Hawaii. We use two teachers for simplicity. (a) MoLA (\ref{['sec: moe']}) consists of teacher-specific LoRA adapters (Teacher Adp.) and general-knowledge LoRA adapters (General Adp.) with two routers controlling the activation of adapters. (b) Coarse-grained distillation (\ref{['sec: cgd']}) first summarizes the knowledge from multiple teachers and then transfers it to the student encoder globally. "T1 Feat.", "T2 Feat,", and "Sum. T. Feat." represents the visual features $I^{T}_{*}$ generated by different teachers and the summarized teacher features $I^{T}_{\textit{cg}}$. (c) In the fine-grained distillation (\ref{['sec: fgd']}), teacher-specific LoRA adapters (T. Adp.) and token importance scoring (\ref{['fig: scores']}) are employed to select and learn from the most informative tokens.
  • Figure 2: The calculation of token importance score $s_{i}$. To focus on the most informative tokens, we consider the similarity among the teacher's features and the input instructions $T$.
  • Figure 3: Hawaii is able to perform vision-language understanding tasks, such as emotion understanding, OCR, spatial reasoning, attribute reasoning, and relation reasoning. The examples are from the following benchmarks: VQA$^{\text{Text}}$singh_towards_2019, MMBench leonardis_mmbench_2025, and SeedBench li_seed-bench_2024.
  • Figure 4: Comparison between Hawaii and MoVE-KD movekd2025 on OCR and visual-semantic reasoning capabilities.
  • Figure 5: Visualization of the similarity score used in calculating importance score (\ref{['sec: fgd']}) using $\text{Hawaii\xspace}^{\dag}$.
  • ...and 1 more figures