Table of Contents
Fetching ...

Task-customized Masked AutoEncoder via Mixture of Cluster-conditional Experts

Zhili Liu, Kai Chen, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, James T. Kwok

TL;DR

The paper tackles negative transfer in self-supervised MaE pre-training by introducing MoCE, a mixture of cluster-conditional experts. By clustering data semantically and routing samples to cluster-specific experts via cluster embeddings, MoCE enables task-customized pre-training without labels. Empirical results across 11 downstream tasks show MoCE surpasses vanilla MAE by about 2.45% in average accuracy, with state-of-the-art performance on detection and segmentation. This approach offers efficient deployment by selecting a task-matched expert and demonstrates the feasibility of self-supervised MoE models on ImageNet.

Abstract

Masked Autoencoder~(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training. However, when the various downstream tasks have data distributions different from the pre-training data, the semantically irrelevant pre-training information might result in negative transfer, impeding MAE's scalability. To address this issue, we propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE), which can be trained once but provides customized pre-training models for diverse downstream tasks. Different from the mixture of experts (MoE), our MoCE trains each expert only with semantically relevant images by using cluster-conditional gates. Thus, each downstream task can be allocated to its customized model pre-trained with data most similar to the downstream data. Experiments on a collection of 11 downstream tasks show that MoCE outperforms the vanilla MAE by 2.45\% on average. It also obtains new state-of-the-art self-supervised learning results on detection and segmentation.

Task-customized Masked AutoEncoder via Mixture of Cluster-conditional Experts

TL;DR

The paper tackles negative transfer in self-supervised MaE pre-training by introducing MoCE, a mixture of cluster-conditional experts. By clustering data semantically and routing samples to cluster-specific experts via cluster embeddings, MoCE enables task-customized pre-training without labels. Empirical results across 11 downstream tasks show MoCE surpasses vanilla MAE by about 2.45% in average accuracy, with state-of-the-art performance on detection and segmentation. This approach offers efficient deployment by selecting a task-matched expert and demonstrates the feasibility of self-supervised MoE models on ImageNet.

Abstract

Masked Autoencoder~(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training. However, when the various downstream tasks have data distributions different from the pre-training data, the semantically irrelevant pre-training information might result in negative transfer, impeding MAE's scalability. To address this issue, we propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE), which can be trained once but provides customized pre-training models for diverse downstream tasks. Different from the mixture of experts (MoE), our MoCE trains each expert only with semantically relevant images by using cluster-conditional gates. Thus, each downstream task can be allocated to its customized model pre-trained with data most similar to the downstream data. Experiments on a collection of 11 downstream tasks show that MoCE outperforms the vanilla MAE by 2.45\% on average. It also obtains new state-of-the-art self-supervised learning results on detection and segmentation.
Paper Structure (35 sections, 5 equations, 3 figures, 11 tables)

This paper contains 35 sections, 5 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: (a) Transfer performance of MAEs pre-trained on Split-A (blue), Split-B (red) and full ImageNet data (white). Only two of the eleven downstream tasks benefit from using the full ImageNet data for pre-training (more details in Section \ref{['sec:negative_transfer']}). (b) TokenMoE uses pixel RGB values as reconstruction targets. Thus, tokens with similar pixel values tend to be routed to the same expert, leading to two types of mistakes: (i) same semantics but routed to different experts, (ii) different semantics but routed to the same expert.
  • Figure 2: Model design comparison between (a) TokenMoE riquelme2021scaling and (b) MoCE. Both methods utilize the multi-expert architecture with the main difference about the input of the gating network. MoCE adopts the corresponding cluster embedding of the current token as in Eqn. \ref{['equ:moce']}, instead of the token embedding in Eqn. \ref{['equ:tokenmoe']}. Therefore, each expert can be trained by semantically similar images to alleviate the negative transfer phenomenon.
  • Figure 3: (a),(c): Routing heatmaps for experts in TokenMoE and MoCE. The x-axis is the expert ID, and the y-axis is the ImageNet semantic label ID. Darker green means a higher proportion of tokens belonging to the corresponding class are allocated to the expert. The label is sorted differently in each figure to make it readable. (b): Example samples from the pre-training dataset of 3 MoCE experts. (d): Relative PSNR improvement of TokenMoE and MoCE over MAE for each downstream task.