Table of Contents
Fetching ...

GLID: Pre-training a Generalist Encoder-Decoder Vision Model

Jihao Liu, Jinliang Zheng, Yu Liu, Hongsheng Li

TL;DR

GLID addresses the pretraining–finetuning gap in vision by pretraining a generalist encoder-decoder with a unified query-to-answer formulation and a Masked Image Modeling objective. Fine-tuning then requires only replacing the top linear head, preserving the majority of pretraining weights and enabling rapid adaptation across object detection, segmentation, depth, and pose tasks. Empirical results show GLID matching or surpassing specialist models across six tasks with improved data efficiency and convergence speed. The work demonstrates that a single, end-to-end pre-trained architecture can effectively handle a broad range of vision problems with minimal task-specific design.

Abstract

This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computer vision tasks. While self-supervised pre-training approaches, e.g., Masked Autoencoder, have shown success in transfer learning, task-specific sub-architectures are still required to be appended for different downstream tasks, which cannot enjoy the benefits of large-scale pre-training. GLID overcomes this challenge by allowing the pre-trained generalist encoder-decoder to be fine-tuned on various vision tasks with minimal task-specific architecture modifications. In the GLID training scheme, pre-training pretext task and other downstream tasks are modeled as "query-to-answer" problems, including the pre-training pretext task and other downstream tasks. We pre-train a task-agnostic encoder-decoder with query-mask pairs. During fine-tuning, GLID maintains the pre-trained encoder-decoder and queries, only replacing the topmost linear transformation layer with task-specific linear heads. This minimizes the pretrain-finetune architecture inconsistency and enables the pre-trained model to better adapt to downstream tasks. GLID achieves competitive performance on various vision tasks, including object detection, image segmentation, pose estimation, and depth estimation, outperforming or matching specialist models such as Mask2Former, DETR, ViTPose, and BinsFormer.

GLID: Pre-training a Generalist Encoder-Decoder Vision Model

TL;DR

GLID addresses the pretraining–finetuning gap in vision by pretraining a generalist encoder-decoder with a unified query-to-answer formulation and a Masked Image Modeling objective. Fine-tuning then requires only replacing the top linear head, preserving the majority of pretraining weights and enabling rapid adaptation across object detection, segmentation, depth, and pose tasks. Empirical results show GLID matching or surpassing specialist models across six tasks with improved data efficiency and convergence speed. The work demonstrates that a single, end-to-end pre-trained architecture can effectively handle a broad range of vision problems with minimal task-specific design.

Abstract

This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computer vision tasks. While self-supervised pre-training approaches, e.g., Masked Autoencoder, have shown success in transfer learning, task-specific sub-architectures are still required to be appended for different downstream tasks, which cannot enjoy the benefits of large-scale pre-training. GLID overcomes this challenge by allowing the pre-trained generalist encoder-decoder to be fine-tuned on various vision tasks with minimal task-specific architecture modifications. In the GLID training scheme, pre-training pretext task and other downstream tasks are modeled as "query-to-answer" problems, including the pre-training pretext task and other downstream tasks. We pre-train a task-agnostic encoder-decoder with query-mask pairs. During fine-tuning, GLID maintains the pre-trained encoder-decoder and queries, only replacing the topmost linear transformation layer with task-specific linear heads. This minimizes the pretrain-finetune architecture inconsistency and enables the pre-trained model to better adapt to downstream tasks. GLID achieves competitive performance on various vision tasks, including object detection, image segmentation, pose estimation, and depth estimation, outperforming or matching specialist models such as Mask2Former, DETR, ViTPose, and BinsFormer.
Paper Structure (15 sections, 8 equations, 5 figures, 8 tables)

This paper contains 15 sections, 8 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: MAE backbone-only pre-training vs. GLID pre-training. The GLID pre-training allows the pre-trained encoder-decoder to be fine-tuned on various vision tasks without task-specific decoder designs and outperforms MAE backbone-only pre-training.
  • Figure 2: Overview of GLID. During pre-training, we pre-train a task-agnostic encoder-decoder transformer architecture through masked image modeling. For fine-tuning on downstream tasks, we replace the pre-training linear head with a task-specific linear head. In this way, the proposed GLID minimizes the pretrain-finetune gap and enables the pre-trained architecture to better adapt to downstream tasks.
  • Figure 3: GLID pre-training pipeline.
  • Figure 4: Query's cross-attention maps on different tasks. Each column shows three-scale attention maps on a specific task. Task-specific queries are utilized for visualization.
  • Figure 5: Performance curves of loading different parts of pre-trained weights.