Table of Contents
Fetching ...

Visual Instruction Pretraining for Domain-Specific Foundation Models

Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, Jian Yang

TL;DR

ViTP addresses the underexplored top-down influence of high-level understanding on low-level perception by embedding a Vision Transformer within a Vision-Language Model and training end-to-end with domain-specific visual instruction data. It introduces Visual Robustness Learning to regularize learning by token dropping, improving robustness and efficiency. Across 16 remote sensing and medical imaging benchmarks, ViTP achieves state-of-the-art performance on detection, segmentation, and change detection tasks, while also showing strong data efficiency and faster pretraining. This top-down pretraining approach offers a scalable path to domain-specific foundation models that leverage reasoning to enhance perceptual representations.

Abstract

Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at https://github.com/zcablii/ViTP.

Visual Instruction Pretraining for Domain-Specific Foundation Models

TL;DR

ViTP addresses the underexplored top-down influence of high-level understanding on low-level perception by embedding a Vision Transformer within a Vision-Language Model and training end-to-end with domain-specific visual instruction data. It introduces Visual Robustness Learning to regularize learning by token dropping, improving robustness and efficiency. Across 16 remote sensing and medical imaging benchmarks, ViTP achieves state-of-the-art performance on detection, segmentation, and change detection tasks, while also showing strong data efficiency and faster pretraining. This top-down pretraining approach offers a scalable path to domain-specific foundation models that leverage reasoning to enhance perceptual representations.

Abstract

Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at https://github.com/zcablii/ViTP.

Paper Structure

This paper contains 47 sections, 4 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: (a) The synergistic relationship between perception and understanding in modern CV. Our proposed ViTP forges a previously underexplored link from high-level understanding to low-level perception. (b) Self-attention activation maps for the query patch (marked with a red cross). ViTP identifies fine-grained object parts that are high-level semantically related.
  • Figure 2: Comparison of pretraining paradigms for Vision Transformer (ViT) foundation models. ViTP employs an instruction-following objective to directly instil domain-specific perception capabilities into the vision backbone.
  • Figure 3: ViTP sets new SOTA performance across a diverse range of downstream tasks in medical imaging and remote sensing.
  • Figure 4: A conceptual illustration of the ViTP framework. A Vision Transformer (ViT) backbone is embedded within a large VLM and then pretrained with domain-specific instruction following objective and Visual Robustness Learning (VRL). This process instils high-level semantic understanding into the ViT. The resulting weights are then used to initialize models for various downstream perception tasks.
  • Figure 5: Effect of Pretraining Duration. RSAR mAP improves with more pretraining steps before saturating at $\sim$8k steps.
  • ...and 4 more figures