Table of Contents
Fetching ...

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Haiyang Wang, Hao Tang, Li Jiang, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang

TL;DR

GiT presents a vanilla ViT-based generalist vision transformer that unifies image understanding, detection, and pixel-level tasks through a universal language interface and parallel decoding. By treating all targets as token sequences and training on 27 datasets without task-specific fine-tuning, GiT achieves strong zero-/few-shot generalization and competitive in-distribution performance while dramatically reducing architectural complexity. The approach leverages a window-based Transformer with a small set of global attention layers, and employs a multi-task training regime that enables cross-task mutual benefits. This work demonstrates a principled pathway to bridge vision and language by extending LLM-like sequence modeling to diverse 2D vision tasks, with broad implications for scalable, unified foundation models in computer vision.

Abstract

This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g, GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g., captioning), over sparse perception (e.g., detection), to dense prediction (e.g., segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models will be available at \url{https://github.com/Haiyang-W/GiT}.

GiT: Towards Generalist Vision Transformer through Universal Language Interface

TL;DR

GiT presents a vanilla ViT-based generalist vision transformer that unifies image understanding, detection, and pixel-level tasks through a universal language interface and parallel decoding. By treating all targets as token sequences and training on 27 datasets without task-specific fine-tuning, GiT achieves strong zero-/few-shot generalization and competitive in-distribution performance while dramatically reducing architectural complexity. The approach leverages a window-based Transformer with a small set of global attention layers, and employs a multi-task training regime that enables cross-task mutual benefits. This work demonstrates a principled pathway to bridge vision and language by extending LLM-like sequence modeling to diverse 2D vision tasks, with broad implications for scalable, unified foundation models in computer vision.

Abstract

This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g, GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g., captioning), over sparse perception (e.g., detection), to dense prediction (e.g., segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models will be available at \url{https://github.com/Haiyang-W/GiT}.
Paper Structure (30 sections, 3 equations, 11 figures, 23 tables)

This paper contains 30 sections, 3 equations, 11 figures, 23 tables.

Figures (11)

  • Figure 1: Generalist Vision Transformer. a) Examples of tasks supported by GiT. b) Architectural comparison between previous LVMs (e.g., LLaVA liu2023visual), and ours. GiT seamlessly handles various vision-centric tasks, particularly fine-grained visual perception, via a universal language interface using a plain transformer (e.g., ViT and GPT).
  • Figure 2: Task-level customization spans from image- to object- and pixel-level, setting $N$ to 1, 625 (25$\times$25), and 1764 (42$\times$42), in real implementation. Red point means localized visual token, generated by image bilinear interpolation at its grid point. Task prompt is text, converted into a token via text and out-of-vocabulary representation.
  • Figure 3: Our multi-task formulation is broadly illustrated as processing four types of user inputs: image patches, instructive language tokens, and $N$ parallel point-based subprocesses, each with its interpolated local image feature and task identifier for efficient parallel visual prediction. As for the language interface, we use a basic vocabulary, a specific vocabulary list required by the current task, and the task-agnostic out-of-vocabulary module (§\ref{['sec:IO']}) to dynamically create vocabulary sets for each task.
  • Figure 4: An illustration of pixel-level multiple parallel decoding. Consider a 64$\times$64 image divided into 16 patches, where each patch is 16$\times$16. With $N$=16 and a decoding step of 16 per subprocess, each grid point covers one patch to predict a 4$\times$4 semantic map, which is then upsampled 4$\times$ to the original size for the final result.
  • Figure 5: Visualizations on cross-attention between task token and image, with yellower colors indicating higher responses.
  • ...and 6 more figures