Table of Contents
Fetching ...

Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT

Zhenxiang Xiao, Yuzhong Chen, Lu Zhang, Junjie Yao, Zihao Wu, Xiaowei Yu, Yi Pan, Lin Zhao, Chong Ma, Xinyu Liu, Wei Liu, Xiang Li, Yixuan Yuan, Dinggang Shen, Dajiang Zhu, Tianming Liu, Xi Jiang

TL;DR

Instruction-ViT integrates multi-modal prompts (text and image) into a ViT backbone to perform instruction-tuned visual classification. By jointly fine-tuning prompt tokens and image/text features, and by using a cosine-based alignment loss alongside standard cross-entropy, the approach achieves strong fine-tuning performance and improved adaptability across datasets. The work demonstrates the viability of transferring instruction-tuning concepts from NLP into vision via prompts, with mixed-modal prompts offering robustness across tasks. It also provides a practical training strategy that reduces computation during inference by selective prompt sampling. Overall, the method advances efficient, flexible visual instruction learning with ViTs and CLIP-based prompts.

Abstract

Prompts have been proven to play a crucial role in large language models, and in recent years, vision models have also been using prompts to improve scalability for multiple downstream tasks. In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification which we called Instruction-ViT. The key idea is to implement multi-modal prompts (text or image prompt) related to category information to guide the fine-tuning of the model. Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved. Our work provided an innovative strategy to fuse multi-modal prompts with better performance and faster adaptability for visual classification models.

Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT

TL;DR

Instruction-ViT integrates multi-modal prompts (text and image) into a ViT backbone to perform instruction-tuned visual classification. By jointly fine-tuning prompt tokens and image/text features, and by using a cosine-based alignment loss alongside standard cross-entropy, the approach achieves strong fine-tuning performance and improved adaptability across datasets. The work demonstrates the viability of transferring instruction-tuning concepts from NLP into vision via prompts, with mixed-modal prompts offering robustness across tasks. It also provides a practical training strategy that reduces computation during inference by selective prompt sampling. Overall, the method advances efficient, flexible visual instruction learning with ViTs and CLIP-based prompts.

Abstract

Prompts have been proven to play a crucial role in large language models, and in recent years, vision models have also been using prompts to improve scalability for multiple downstream tasks. In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification which we called Instruction-ViT. The key idea is to implement multi-modal prompts (text or image prompt) related to category information to guide the fine-tuning of the model. Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved. Our work provided an innovative strategy to fuse multi-modal prompts with better performance and faster adaptability for visual classification models.
Paper Structure (19 sections, 6 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 19 sections, 6 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: The overall framework of Instruction-ViT. For each image input, the corresponding latent text or visual features are considered as the prompts, by using Transformer's attention mechanism to combine the features of input image and prompts. CLS token is used to complete the downstream task of classification, and the similarity scores computed by CLS and prompt tokens are used to assist in the fine-tuning of the model. At the training stage, the pink module is fine-tuning and the navy blue module keeps frozen.
  • Figure 2: Running mechanism of prompts selected in validation. For an input image of the validation set, feature extraction is performed using the zero-shot CLIP model for the potentially possible class and the image, and its similarity score is calculated. The K prompt tokens with the highest similarity and the average of remaining N-K prompt tokens are selected to next module.