Table of Contents
Fetching ...

Adapting LLaMA Decoder to Vision Transformer

Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong Liu, Taiqiang Wu, Kaipeng Zhang, Songyang Zhang, Kai Chen, Ping Luo

TL;DR

This paper rethinks vision backbone design by adapting a decoder-only LLaMA architecture to Vision Transformers, addressing the training instability caused by causal self-attention with two key techniques: post-sequence class token placement (PS [cls]) and a soft mask scheduling that transitions from bidirectional to causal attention. The resulting iLLaMA achieves competitive ImageNet performance with a lightweight footprint (e.g., 5.7M parameters achieving 75.1% top-1) and scales to ~310M parameters with ImageNet-21K pretraining to reach about 86.0% Top-1, surpassing several encoder-based and LM-inspired baselines. The work also demonstrates practical properties such as strong calibration, notable shape-texture bias, and robust 8-bit quantization compatibility, while maintaining isotropic architecture benefits for throughput. Overall, the study provides a principled pathway toward unified, decoder-based architectures for vision and potentially multimodal tasks, inviting further exploration of LLM-inspired designs in vision.

Abstract

This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to $\sim$310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: shape-texture bias, calibration, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual architectures in the wave of LLMs and inspire the development of unified multimodal models. Pre-trained models and codes are available https://github.com/techmonsterwang/iLLaMA.

Adapting LLaMA Decoder to Vision Transformer

TL;DR

This paper rethinks vision backbone design by adapting a decoder-only LLaMA architecture to Vision Transformers, addressing the training instability caused by causal self-attention with two key techniques: post-sequence class token placement (PS [cls]) and a soft mask scheduling that transitions from bidirectional to causal attention. The resulting iLLaMA achieves competitive ImageNet performance with a lightweight footprint (e.g., 5.7M parameters achieving 75.1% top-1) and scales to ~310M parameters with ImageNet-21K pretraining to reach about 86.0% Top-1, surpassing several encoder-based and LM-inspired baselines. The work also demonstrates practical properties such as strong calibration, notable shape-texture bias, and robust 8-bit quantization compatibility, while maintaining isotropic architecture benefits for throughput. Overall, the study provides a principled pathway toward unified, decoder-based architectures for vision and potentially multimodal tasks, inviting further exploration of LLM-inspired designs in vision.

Abstract

This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to 310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: shape-texture bias, calibration, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual architectures in the wave of LLMs and inspire the development of unified multimodal models. Pre-trained models and codes are available https://github.com/techmonsterwang/iLLaMA.
Paper Structure (35 sections, 2 equations, 11 figures, 18 tables, 1 algorithm)

This paper contains 35 sections, 2 equations, 11 figures, 18 tables, 1 algorithm.

Figures (11)

  • Figure 1: Left: iLLaMA architecture. Right: our design roadmap. Colored and gray bars represent the results of the tiny and base regimes, with the red line depicting the training loss of the tiny regime. iLLaMA strives to process visual tokens using standard LLaMa components, e.g., causal self-attention. The proposed PS [cls] and soft mask strategy help overcome training challenges. Block details of ViT dosovitskiy2020image, VisionLLaMA chu2024visionllama, and our iLLaMA is compared in Figure \ref{['fig:comparison']} in Appendix \ref{['sec:8.1']}.
  • Figure 2: (a) mask in causal self-attention. (b) mask in causal self-attention with our post-sequence class token (PS [cls]) method. (c) modified causal mask. Their ablation results are shown in Table \ref{['tab:abl_post']}.
  • Figure 3: (a) Soft mask gradually transitions from a bi-directional mask into a causal mask during training through a constant or linear schedule. (b) Ablation results of training loss and test accuracy.
  • Figure 4: Rank analysis of the attention map in head 1, layer 1 of the pretrained ViT-T and iLLaMA-T with $N=197$. Difference between them is about 48.
  • Figure 5: Comparison between ViT dosovitskiy2020image, VisionLLaMA chu2024visionllama, and iLLaMA blocks.
  • ...and 6 more figures