Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

Hao Dong; Eleni Chatzi; Olga Fink

Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

Hao Dong, Eleni Chatzi, Olga Fink

TL;DR

Multimodal Open-Set DG/DA addresses learning under distribution shifts and unseen classes across multiple modalities. The paper introduces MOOSA, a self-supervised framework that combines Masked Cross-modal Translation (generative) and Multimodal Jigsaw Puzzles (contrastive) with an entropy-based mechanism to balance modalities, and extends it to MM-OSDA with unlabeled target data. It formulates MM-OSDG with multiple source domains and a target of shared plus unknown classes, and demonstrates robust improvements over state-of-the-art baselines on EPIC-Kitchens and HAC across open-set and closed-set settings. The work provides code and extensive empirical evidence that the proposed pretext tasks and entropy weighting enhance cross-modal generalization and reliable open-set detection in realistic multimodal scenarios.

Abstract

The task of open-set domain generalization (OSDG) involves recognizing novel classes within unseen domains, which becomes more challenging with multiple modalities as input. Existing works have only addressed unimodal OSDG within the meta-learning framework, without considering multimodal scenarios. In this work, we introduce a novel approach to address Multimodal Open-Set Domain Generalization (MM-OSDG) for the first time, utilizing self-supervision. To this end, we introduce two innovative multimodal self-supervised pretext tasks: Masked Cross-modal Translation and Multimodal Jigsaw Puzzles. These tasks facilitate the learning of multimodal representative features, thereby enhancing generalization and open-class detection capabilities. Additionally, we propose a novel entropy weighting mechanism to balance the loss across different modalities. Furthermore, we extend our approach to tackle also the Multimodal Open-Set Domain Adaptation (MM-OSDA) problem, especially in scenarios where unlabeled data from the target domain is available. Extensive experiments conducted under MM-OSDG, MM-OSDA, and Multimodal Closed-Set DG settings on the EPIC-Kitchens and HAC datasets demonstrate the efficacy and versatility of the proposed approach. Our source code is available at https://github.com/donghao51/MOOSA.

Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

TL;DR

Abstract

Paper Structure (20 sections, 8 equations, 7 figures, 19 tables)

This paper contains 20 sections, 8 equations, 7 figures, 19 tables.

Introduction
Related Work
Methodology
Multimodal Open-Set Domain Generalization
Motivation: Self-supervised Pretext Tasks
Generative Task: Masked Cross-modal Translation
Contrastive Task: Multimodal Jigsaw Puzzles
Entropy Weighting and Minimization
Final Loss and Inference
Extension to Multimodal Open-Set Domain Adaptation
Experiments
Experimental Setting
Results
Ablation Studies and Analysis
Conclusion
...and 5 more sections

Figures (7)

Figure 1: Our proposed MOOSA framework for MM-OSDG. EntWei & EntMin: Entropy Weighting and Minimization.
Figure 2: The average $HOS$ with the varying known-unknown split rates on EPIC-Kitchens dataset.
Figure 3: The distribution curves of prediction confidence for unknown and known classes obtained using various methods.
Figure 4: Class splits for different label sets across domains on EPIC-Kitchens dataset.
Figure 5: Parameter sensitivity to the hyperparameters in the self-supervised tasks.
...and 2 more figures

Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

TL;DR

Abstract

Towards Multimodal Open-Set Domain Generalization and Adaptation through Self-supervision

Authors

TL;DR

Abstract

Table of Contents

Figures (7)