Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT
Zhenxiang Xiao, Yuzhong Chen, Lu Zhang, Junjie Yao, Zihao Wu, Xiaowei Yu, Yi Pan, Lin Zhao, Chong Ma, Xinyu Liu, Wei Liu, Xiang Li, Yixuan Yuan, Dinggang Shen, Dajiang Zhu, Tianming Liu, Xi Jiang
TL;DR
Instruction-ViT integrates multi-modal prompts (text and image) into a ViT backbone to perform instruction-tuned visual classification. By jointly fine-tuning prompt tokens and image/text features, and by using a cosine-based alignment loss alongside standard cross-entropy, the approach achieves strong fine-tuning performance and improved adaptability across datasets. The work demonstrates the viability of transferring instruction-tuning concepts from NLP into vision via prompts, with mixed-modal prompts offering robustness across tasks. It also provides a practical training strategy that reduces computation during inference by selective prompt sampling. Overall, the method advances efficient, flexible visual instruction learning with ViTs and CLIP-based prompts.
Abstract
Prompts have been proven to play a crucial role in large language models, and in recent years, vision models have also been using prompts to improve scalability for multiple downstream tasks. In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification which we called Instruction-ViT. The key idea is to implement multi-modal prompts (text or image prompt) related to category information to guide the fine-tuning of the model. Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved. Our work provided an innovative strategy to fuse multi-modal prompts with better performance and faster adaptability for visual classification models.
