Table of Contents
Fetching ...

SG-MIM: Structured Knowledge Guided Efficient Pre-training for Dense Prediction

Sumin Son, Hyesong Choi, Dongbo Min

TL;DR

SG-MIM tackles the limited transfer of Masked Image Modeling to dense prediction by introducing a structured knowledge guided framework that decouples structured information encoding from the image encoder. A lightweight relational guidance module and semantic selective masking enable the model to leverage spatially structured cues without additional annotations, aligning pre-training with downstream tasks. Empirical results on monocular depth estimation and semantic segmentation demonstrate improved performance over existing MIM baselines, with efficiency benefits from an MLP-based guidance path and effective high-frequency feature preservation via Fourier analysis. The approach offers a practical, generalizable boost to dense vision tasks and suggests a path toward richer 3D-aware pre-training in future work.

Abstract

Masked Image Modeling (MIM) techniques have redefined the landscape of computer vision, enabling pre-trained models to achieve exceptional performance across a broad spectrum of tasks. Despite their success, the full potential of MIM-based methods in dense prediction tasks, particularly in depth estimation, remains untapped. Existing MIM approaches primarily rely on single-image inputs, which makes it challenging to capture the crucial structured information, leading to suboptimal performance in tasks requiring fine-grained feature representation. To address these limitations, we propose SG-MIM, a novel Structured knowledge Guided Masked Image Modeling framework designed to enhance dense prediction tasks by utilizing structured knowledge alongside images. SG-MIM employs a lightweight relational guidance framework, allowing it to guide structured knowledge individually at the feature level rather than naively combining at the pixel level within the same architecture, as is common in traditional multi-modal pre-training methods. This approach enables the model to efficiently capture essential information while minimizing discrepancies between pre-training and downstream tasks. Furthermore, SG-MIM employs a selective masking strategy to incorporate structured knowledge, maximizing the synergy between general representation learning and structured knowledge-specific learning. Our method requires no additional annotations, making it a versatile and efficient solution for a wide range of applications. Our evaluations on the KITTI, NYU-v2, and ADE20k datasets demonstrate SG-MIM's superiority in monocular depth estimation and semantic segmentation.

SG-MIM: Structured Knowledge Guided Efficient Pre-training for Dense Prediction

TL;DR

SG-MIM tackles the limited transfer of Masked Image Modeling to dense prediction by introducing a structured knowledge guided framework that decouples structured information encoding from the image encoder. A lightweight relational guidance module and semantic selective masking enable the model to leverage spatially structured cues without additional annotations, aligning pre-training with downstream tasks. Empirical results on monocular depth estimation and semantic segmentation demonstrate improved performance over existing MIM baselines, with efficiency benefits from an MLP-based guidance path and effective high-frequency feature preservation via Fourier analysis. The approach offers a practical, generalizable boost to dense vision tasks and suggests a path toward richer 3D-aware pre-training in future work.

Abstract

Masked Image Modeling (MIM) techniques have redefined the landscape of computer vision, enabling pre-trained models to achieve exceptional performance across a broad spectrum of tasks. Despite their success, the full potential of MIM-based methods in dense prediction tasks, particularly in depth estimation, remains untapped. Existing MIM approaches primarily rely on single-image inputs, which makes it challenging to capture the crucial structured information, leading to suboptimal performance in tasks requiring fine-grained feature representation. To address these limitations, we propose SG-MIM, a novel Structured knowledge Guided Masked Image Modeling framework designed to enhance dense prediction tasks by utilizing structured knowledge alongside images. SG-MIM employs a lightweight relational guidance framework, allowing it to guide structured knowledge individually at the feature level rather than naively combining at the pixel level within the same architecture, as is common in traditional multi-modal pre-training methods. This approach enables the model to efficiently capture essential information while minimizing discrepancies between pre-training and downstream tasks. Furthermore, SG-MIM employs a selective masking strategy to incorporate structured knowledge, maximizing the synergy between general representation learning and structured knowledge-specific learning. Our method requires no additional annotations, making it a versatile and efficient solution for a wide range of applications. Our evaluations on the KITTI, NYU-v2, and ADE20k datasets demonstrate SG-MIM's superiority in monocular depth estimation and semantic segmentation.
Paper Structure (18 sections, 5 equations, 3 figures, 6 tables)

This paper contains 18 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Comparison with existing multimodal pre-training: (a) Multimodal pre-training and (b) the proposed method (SG-MIM). While a common form of multimodal pre-training method, e.g., bachmann2022multimaeweinzaepfel2022croco, integrates both types of data directly into the Transformer encoder , SG-MIM uses a lighter relational guidance framework.
  • Figure 2: Overview of the proposed SG-MIM. Image and Structured knowledge map are masked in accordance with $M_I$ and $M_S$, respectively. The masked image, combined with masked tokens, enters the encoder (ViT dosovitskiy2020image or Swin transformer liu2021swin), resulting in the image latent representation $I_F$. This proceeds to the image prediction head to predict the original image values for the missing patches. Simultaneously, $I_F$ is transformed into a structured knowledge-guided image latent representation $I_{SF}$ within the relational guidance framework, aided by $S_F$ extracted through shallow MLP layers. This is then directed to the prediction head, arranged in parallel, to predict the structured information for the visible image patches. Note that only the pre-trained Transformer encoder is used in the subsequent downstream tasks.
  • Figure 3: Relative log amplitudes of Fourier transformed feature maps in Dense Prediction: We present the comparison of relative log amplitudes in Fourier transformed feature maps between SG-MIM and SimMIM xie2022simmim for dense prediction tasks. Panel (a) illustrates the feature maps for depth estimation on the KITTI geiger2013vision validation dataset, and panel (b) displays the feature maps for segmentation on the ADE20K zhou2017scene validation dataset.