Table of Contents
Fetching ...

KDC-MAE: Knowledge Distilled Contrastive Mask Auto-Encoder

Maheswar Bora, Saurabh Atreya, Aritra Mukherjee, Abhijit Das

TL;DR

Experimental results conclude that the contrastive masking correspondence along with the KD learning objective has lent a hand to performing better learning for multiple modalities over multiple tasks.

Abstract

In this work, we attempted to extend the thought and showcase a way forward for the Self-supervised Learning (SSL) learning paradigm by combining contrastive learning, self-distillation (knowledge distillation) and masked data modelling, the three major SSL frameworks, to learn a joint and coordinated representation. The proposed technique of SSL learns by the collaborative power of different learning objectives of SSL. Hence to jointly learn the different SSL objectives we proposed a new SSL architecture KDC-MAE, a complementary masking strategy to learn the modular correspondence, and a weighted way to combine them coordinately. Experimental results conclude that the contrastive masking correspondence along with the KD learning objective has lent a hand to performing better learning for multiple modalities over multiple tasks.

KDC-MAE: Knowledge Distilled Contrastive Mask Auto-Encoder

TL;DR

Experimental results conclude that the contrastive masking correspondence along with the KD learning objective has lent a hand to performing better learning for multiple modalities over multiple tasks.

Abstract

In this work, we attempted to extend the thought and showcase a way forward for the Self-supervised Learning (SSL) learning paradigm by combining contrastive learning, self-distillation (knowledge distillation) and masked data modelling, the three major SSL frameworks, to learn a joint and coordinated representation. The proposed technique of SSL learns by the collaborative power of different learning objectives of SSL. Hence to jointly learn the different SSL objectives we proposed a new SSL architecture KDC-MAE, a complementary masking strategy to learn the modular correspondence, and a weighted way to combine them coordinately. Experimental results conclude that the contrastive masking correspondence along with the KD learning objective has lent a hand to performing better learning for multiple modalities over multiple tasks.

Paper Structure

This paper contains 11 sections, 5 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Proposed improvement on existing SSL by complementary mask and self-distillation. The model uses shared weights with two masked versions of the same audio-video pair passed through the encoder, generating separate joint latent embeddings. KL divergence loss aligns these embeddings, followed by a joint decoder that splits audio and video, with contrastive loss applied to the latent embeddings.
  • Figure 2: (a) The complementary mask generation. Note: Two video masks and audio masks are complementary but there is no relation between any two audio and video masks, (b) The proposed architecture of KDC-MAE: (Symbol index on the right)