Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

Hao Li; Jinguo Zhu; Xiaohu Jiang; Xizhou Zhu; Hongsheng Li; Chun Yuan; Xiaohua Wang; Yu Qiao; Xiaogang Wang; Wenhai Wang; Jifeng Dai

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

Hao Li, Jinguo Zhu, Xiaohu Jiang, Xizhou Zhu, Hongsheng Li, Chun Yuan, Xiaohua Wang, Yu Qiao, Xiaogang Wang, Wenhai Wang, Jifeng Dai

TL;DR

Uni-Perceiver v2 presents a generalist architecture that jointly handles major vision and vision-language tasks without task-specific fine-tuning. By encoding images as global and region proposals and texts with a pre-trained language model, all modalities are processed through a shared decoder under a unified maximum likelihood objective. The model employs an unmixed sampling strategy and MT-AdamW with Conditional MoEs to stabilize multi-task learning, achieving competitive or state-of-the-art results among generalist models across classification, detection, segmentation, captioning, and retrieval. This work narrows the gap between generalist and task-specific models and demonstrates strong cross-task generalization with publicly trained components. Limits include the lack of verified image generation results due to computational constraints.

Abstract

Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an improved optimizer to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific fine-tuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

TL;DR

Abstract

Paper Structure (19 sections, 10 equations, 4 figures, 8 tables)

This paper contains 19 sections, 10 equations, 4 figures, 8 tables.

Introduction
Related Work
Revisiting Uni-Perceivers
Method
Method
Encoding Images as General Region Proposals
Encoding Text with Language Models
General Task Adaptation
Sampling Strategy and Improved Optimization
Experiments
Datasets
Implementation Details
Ablation Studies
Main Results
Conclusion
...and 4 more sections

Figures (4)

Figure 1: Comparison of foundation models and Uni-Perceiver v2. $E^I$ and $E^T$ denote the image encoder and text encoder, respectively. In existing foundation models, task-specific decoders $D_\text{cls}$, $D_\text{det}, \dots$ are employed to tune $E^I$ and $E^T$ in different task-specific finetuning. The total number of parameters $\#P_\text{total}$ in adaptation grow with the number of visual/linguistic tasks, denoted as $N^I_\text{task}$ and $N^T_\text{task}$, respectively. By contrast, our Uni-Perceiver v2 shares all parameters across various downstream tasks with a general decoder $D_\text{general}$, where no task-specific fine-tuning is incorporated. Better than previous generalist models, our method can also effectively handle pillar tasks such as image classification, object detection, instance segmentation, and image-text retrieval.
Figure 2: Comparison with generalist models and commonly-recognized strong task-specific models on pillar vision and vision-language tasks. For generalist models including Uni-Perceiver v2, we only report the results without any task-specific fine-tuning. Uni-Perceiver v2 (Uni-P v2) is compared with competitive specialized models, i.e., Swin-large liu2021swin, DINO zhang2022dino, Mask DINO li2022mask, OSCAR-L li2020oscar and ALIGN align, and previous SoTA generalists, i.e., Uni-P-MoE-L zhu2022uni, Pix2seq v2 pix2seqv2, and Flamingo-3B alayrac2022flamingo.
Figure 3: Architecture overview of our Uni-Perceiver v2.
Figure 4: Detection results on novel categories. We show the detection results of images from ImageNet-1k validation set. Note that Uni-Perceiver v2 only uses COCO dataset for the training of image detection task, and most classes in ImageNet-1k are not seen in training.

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

TL;DR

Abstract

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)