Table of Contents
Fetching ...

Bootstrapping SparseFormers from Vision Foundation Models

Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou

TL;DR

The paper presents a simple yet effective method to bootstraps SparseFormer from large vision foundation models by inheriting pretrained cortex-transformer weights and training only a lightweight focusing transformer to align final representations using unlabeled images. This approach enables rapid construction of high-performance, token-efficient vision backbones that can serve both unimodal and multimodal tasks, with demonstrated gains in ImageNet-1K accuracy using as few as 49 tokens and favorable throughput. It further shows zero-shot and retrieval capabilities when bootstrapping from CLIP models and demonstrates seamless integration into multimodal LLM pipelines like LLaVa, reducing the visual token budget while preserving language-vision capabilities. The work highlights a practical pathway to scalable, efficient visual backbones, while noting limitations related to reliance on transformer-based foundation models and access to pretrained weights.

Abstract

The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers from scratch is still expensive, and scaling up the number of parameters can be challenging. In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. Since the majority of SparseFormer blocks are the standard transformer ones, we can inherit weights from large-scale pre-trained vision transformers and freeze them as much as possible. Therefore, we only need to train the SparseFormer-specific lightweight focusing transformer to adjust token RoIs and fine-tune a few early pre-trained blocks to align the final token representation. In such a way, we can bootstrap SparseFormer architectures from various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and without labels or captions within just a few hours. As a result, the bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from CLIPs also demonstrates notable zero-shot performance with highly reduced computational cost without seeing any caption during the bootstrapping procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output space with language without seeing a word, can serve as efficient vision encoders in multimodal large language models. Code and models are available at https://github.com/showlab/sparseformer

Bootstrapping SparseFormers from Vision Foundation Models

TL;DR

The paper presents a simple yet effective method to bootstraps SparseFormer from large vision foundation models by inheriting pretrained cortex-transformer weights and training only a lightweight focusing transformer to align final representations using unlabeled images. This approach enables rapid construction of high-performance, token-efficient vision backbones that can serve both unimodal and multimodal tasks, with demonstrated gains in ImageNet-1K accuracy using as few as 49 tokens and favorable throughput. It further shows zero-shot and retrieval capabilities when bootstrapping from CLIP models and demonstrates seamless integration into multimodal LLM pipelines like LLaVa, reducing the visual token budget while preserving language-vision capabilities. The work highlights a practical pathway to scalable, efficient visual backbones, while noting limitations related to reliance on transformer-based foundation models and access to pretrained weights.

Abstract

The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers from scratch is still expensive, and scaling up the number of parameters can be challenging. In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. Since the majority of SparseFormer blocks are the standard transformer ones, we can inherit weights from large-scale pre-trained vision transformers and freeze them as much as possible. Therefore, we only need to train the SparseFormer-specific lightweight focusing transformer to adjust token RoIs and fine-tune a few early pre-trained blocks to align the final token representation. In such a way, we can bootstrap SparseFormer architectures from various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and without labels or captions within just a few hours. As a result, the bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from CLIPs also demonstrates notable zero-shot performance with highly reduced computational cost without seeing any caption during the bootstrapping procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output space with language without seeing a word, can serve as efficient vision encoders in multimodal large language models. Code and models are available at https://github.com/showlab/sparseformer
Paper Structure (26 sections, 7 equations, 7 figures, 8 tables)

This paper contains 26 sections, 7 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: SparseFormer bootstrapping procedure and task evaluation. a) With only images as inputs, we bootstrap SparseFormers from vision foundation models by inheriting weights and aligning final representations with much fewer tokens (e.g., $0.25\times$). b) Bootstrapped SparseFormers can serve as the efficient vision encoder in either off-the-shelf or fine-tuning manner for both unimodal and multimodal tasks.
  • Figure 2: The detailed bootstrapping procedure. We typically set the number of sparse latent tokens in SparseFormers to $1/4$ those in vision transformers. The starting index of tunable blocks $i$ is $N/3$ and the frozen is $2N/3$ for all bootstrapping settings. Note that [CLS] represents the extra token in vision transformers besides visual tokens and there is no classification supervision on [CLS] in bootstrapping.
  • Figure 3: Visualization on the original SparseFormer-Tiny sparseformer and our bootstrapped ViT-B/16-224$_\textrm{AugReg}$. For each image, there are an input image, token RoIs in the {first, third, last} stage, and sampling points in the last stage in the focusing transformer from left to right.
  • Figure 4: RoI adjustments in each iteration in SF-B$_\textrm{AugReg}$.
  • Figure 5: RoI adjustments (cont'd).
  • ...and 2 more figures