Table of Contents
Fetching ...

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model

Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

TL;DR

This work addresses the challenge of deploying capable vision-language models on mobile devices by recognizing that cross-modal alignment knowledge is underrepresented in prior knowledge-distillation methods. It introduces Align-KD, a lightweight KD approach that distills cross-modal alignment from the first-layer text-query-vision attention and enhances vision token representations based on the text's focus, guided by a strong 7B teacher. The method yields consistent improvements for the MobileVLM V2 1.7B student across six benchmarks under two data subsets, achieving around a 2.0-point average gain and notable gains on specific tasks such as SQA and GQA, while remaining feasible under resource-constrained training. The contribution offers a practical path to stronger edge VLMs without training massive models, enabling broader offline deployment and privacy-preserving AI applications.

Abstract

Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assistant software. Some efforts try to migrate VLMs to edge devices to expand their application scope. Simplifying the model structure is a common method, but as the model shrinks, the trade-off between performance and size becomes more and more difficult. Knowledge distillation (KD) can help models improve comprehensive capabilities without increasing size or data volume. However, most of the existing large model distillation techniques only consider applications on single-modal LLMs, or only use teachers to create new data environments for students. None of these methods take into account the distillation of the most important cross-modal alignment knowledge in VLMs. We propose a method called Align-KD to guide the student model to learn the cross-modal matching that occurs at the shallow layer. The teacher also helps student learn the projection of vision token into text embedding space based on the focus of text. Under the guidance of Align-KD, the 1.7B MobileVLM V2 model can learn rich knowledge from the 7B teacher model with light design of training loss, and achieve an average score improvement of 2.0 across 6 benchmarks under two training subsets respectively. Code is available at: https://github.com/fqhank/Align-KD.

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model

TL;DR

This work addresses the challenge of deploying capable vision-language models on mobile devices by recognizing that cross-modal alignment knowledge is underrepresented in prior knowledge-distillation methods. It introduces Align-KD, a lightweight KD approach that distills cross-modal alignment from the first-layer text-query-vision attention and enhances vision token representations based on the text's focus, guided by a strong 7B teacher. The method yields consistent improvements for the MobileVLM V2 1.7B student across six benchmarks under two data subsets, achieving around a 2.0-point average gain and notable gains on specific tasks such as SQA and GQA, while remaining feasible under resource-constrained training. The contribution offers a practical path to stronger edge VLMs without training massive models, enabling broader offline deployment and privacy-preserving AI applications.

Abstract

Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assistant software. Some efforts try to migrate VLMs to edge devices to expand their application scope. Simplifying the model structure is a common method, but as the model shrinks, the trade-off between performance and size becomes more and more difficult. Knowledge distillation (KD) can help models improve comprehensive capabilities without increasing size or data volume. However, most of the existing large model distillation techniques only consider applications on single-modal LLMs, or only use teachers to create new data environments for students. None of these methods take into account the distillation of the most important cross-modal alignment knowledge in VLMs. We propose a method called Align-KD to guide the student model to learn the cross-modal matching that occurs at the shallow layer. The teacher also helps student learn the projection of vision token into text embedding space based on the focus of text. Under the guidance of Align-KD, the 1.7B MobileVLM V2 model can learn rich knowledge from the 7B teacher model with light design of training loss, and achieve an average score improvement of 2.0 across 6 benchmarks under two training subsets respectively. Code is available at: https://github.com/fqhank/Align-KD.

Paper Structure

This paper contains 17 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Radar plot of MobileVLM V2 1.7B model' performances with Align-KD policy under different settings. Long and Short refer to different subdatasets for training, and MVLM2 refers to MobileVLM V2 model.
  • Figure 2: Exploration of feature changing trend at different layers of MobileVLM V2 model families. (a) Cosine similarity of features from every two adjacent layers. (b) Cosine similarity of features from original vision and text embedding positions within each same layer. The data is presented in an order of magnitude to highlight the trend of change. (c) Normalized Euclidean distance of features from original vision and text embedding positions within each same layer.
  • Figure 3: Left: The overall framework of Align-KD. Align-KD utilizes the text-query-vision attention of teacher model's first layer to extract the knowledge of cross-modal alignment, then injects this knowledge into the cross-modal attention matrix of student's first layer. Besides, the projected vision tokens are dynamically enhanced according to text's focusing, also based on teacher's first layer cross-modal attention. Right: The schematic diagram of vision-language models' (VLMs) first layer attention matrix. $A_{v-v}$, $A_{t-v}$ and $A_{t-t}$ attention refer to vision-query-vision, text-query-vision and text-query-text attention.
  • Figure 4: Different text prompts cause different attention on vision tokens of the picture. The vision tokens with high attention activation also distribute sparsely.