VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks

Xiangxiang Chu; Jianlin Su; Bo Zhang; Chunhua Shen

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks

Xiangxiang Chu, Jianlin Su, Bo Zhang, Chunhua Shen

TL;DR

VisionLLaMA introduces a unified LLaMA‑style vision transformer with plain and pyramid backbones, augmented by AS2DRoPE to support arbitrary resolutions. It demonstrates strong, multi‑task performance across image generation, classification, segmentation, and detection, often outperforming strong baselines and showing faster convergence. The work shows that extending RoPE to 2D and interpolating across resolutions enables robust generalization and cross‑task applicability, including self‑supervised pretraining with MAE. A public code release accompanies extensive experiments, underscoring VisionLLaMA as a versatile backbone for future vision‑language‑style architectures.

Abstract

Large language models are built on top of a transformer-based architecture to process textual inputs. For example, the LLaMA stands out among many open-source implementations. Can the same transformer be used to process 2D images? In this paper, we answer this question by unveiling a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose. VisionLLaMA is a unified and generic modelling framework for solving most vision tasks. We extensively evaluate its effectiveness using typical pre-training paradigms in a good portion of downstream tasks of image perception and especially image generation. In many cases, VisionLLaMA have exhibited substantial gains over the previous state-of-the-art vision transformers. We believe that VisionLLaMA can serve as a strong new baseline model for vision generation and understanding. Our code is released at https://github.com/Meituan-AutoML/VisionLLaMA.

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks

TL;DR

Abstract

Paper Structure (42 sections, 7 equations, 5 figures, 18 tables)

This paper contains 42 sections, 7 equations, 5 figures, 18 tables.

Introduction
Related Work
Method
Plain Transformer
Pyramid Transformer
Training or Inference Beyond the Sequence Length
Experiments
Image Generation
Classification on ImageNet
Supervised Training
Self-Supervised Training
Semantic Segmentation on ADE20K
Supervised Training
Self-Supervised Training
Object Detection on COCO
...and 27 more sections

Figures (5)

Figure 1: Generated images by DiT-LLaMA-XL of resolution (256, 256) with CFG.
Figure 2: VisionLLaMA block (a) in plain Transformer and (b) in pyramid Transformer.
Figure 3: Faster convergence of VisionLLaMA using the setting of DeiT3.
Figure 4: Loss curve of MAE pre-training on VisionLLaMA compared with ViT-B.
Figure 5: Position calibration for GSA's keys using a simple case of $4\times4$ resolution and a kernel size of $2\times2$. The positions of the four points (abstraction keys) are (0.5, 0.5), (1, 2.5), (2.5, 0.5), (2.5, 2.5).

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks

TL;DR

Abstract

VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)