A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders

Muhammad Abdullah Jamal; Omid Mohareri

A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders

Muhammad Abdullah Jamal, Omid Mohareri

TL;DR

A new progressive pre-training method for image understanding tasks which leverages RGB-D datasets using Multi-Modal Contrastive Masked Autoencoder and Denoising techniques, which is scalable, robust and suitable for pre-training RGB-D datasets.

Abstract

In this paper, we propose a new progressive pre-training method for image understanding tasks which leverages RGB-D datasets. The method utilizes Multi-Modal Contrastive Masked Autoencoder and Denoising techniques. Our proposed approach consists of two stages. In the first stage, we pre-train the model using contrastive learning to learn cross-modal representations. In the second stage, we further pre-train the model using masked autoencoding and denoising/noise prediction used in diffusion models. Masked autoencoding focuses on reconstructing the missing patches in the input modality using local spatial correlations, while denoising learns high frequency components of the input data. Moreover, it incorporates global distillation in the second stage by leveraging the knowledge acquired in stage one. Our approach is scalable, robust and suitable for pre-training RGB-D datasets. Extensive experiments on multiple datasets such as ScanNet, NYUv2 and SUN RGB-D show the efficacy and superior performance of our approach. Specifically, we show an improvement of +1.3% mIoU against Mask3D on ScanNet semantic segmentation. We further demonstrate the effectiveness of our approach in low-data regime by evaluating it for semantic segmentation task against the state-of-the-art methods.

A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders

TL;DR

Abstract

Paper Structure (31 sections, 6 equations, 3 figures, 14 tables)

This paper contains 31 sections, 6 equations, 3 figures, 14 tables.

Introduction
Related Work
Contrastive learning based pre-training.
Masked Autoencoder based pre-training.
Multi-Modal Learning.
Denoising.
Knowledge Distillation.
Method
Overview
Contrastive learning
Masked Autoencoding
Denoising
Feature Distillation
Progressive Pre-training
Results
...and 16 more sections

Figures (3)

Figure 1: Overview of our progressive pre-training. In the first stage, we pre-train the encoders using Contrastive learning to align the RGB and the depth patches. In the second stage, we initialize the modality-specific encoders with the stage-1 weights and pre-train them using Multi-modal Masked autoencoding and Denoising to reconstruct the masked patches in depth input and the noise in the unmasked patches respectively. Moreover, we incorporate Feature distillation to leverage the knowledge acquired in the stage-1.
Figure 2: Denoising: We first add Gaussian noise to the unmasked patches. We then pass the noise level $\sigma$ through MLP, and add it to encoded tokens before passing them through the decoder for reconstruction.
Figure 3: To show data-efficiency feature of our approach, we compare with recent state-of-the-art pre-training approaches on ScanNet 2D semantic segmentation under limited labeled data scenarios.

A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders

TL;DR

Abstract

A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders

Authors

TL;DR

Abstract

Table of Contents

Figures (3)