Table of Contents
Fetching ...

Towards Multimodal Domain Generalization with Few Labels

Hongzhao Li, Hao Dong, Hualei Wan, Shupan Li, Mingliang Xu, Muhammad Haris Khan

Abstract

Multimodal models ideally should generalize to unseen domains while remaining data-efficient to reduce annotation costs. To this end, we introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG), which aims to learn robust multimodal models from multi-source data with few labeled samples. We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. We further establish the first SSMDG benchmarks, on which our method consistently outperforms strong baselines in both standard and missing-modality scenarios. Our benchmarks and code are available at https://github.com/lihongzhao99/SSMDG.

Towards Multimodal Domain Generalization with Few Labels

Abstract

Multimodal models ideally should generalize to unseen domains while remaining data-efficient to reduce annotation costs. To this end, we introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG), which aims to learn robust multimodal models from multi-source data with few labeled samples. We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. We further establish the first SSMDG benchmarks, on which our method consistently outperforms strong baselines in both standard and missing-modality scenarios. Our benchmarks and code are available at https://github.com/lihongzhao99/SSMDG.
Paper Structure (16 sections, 8 equations, 5 figures, 5 tables)

This paper contains 16 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Illustration of four related learning settings: Semi-Supervised Multimodal Learning (SSML), Multimodal Domain Generalization (MMDG), Semi-Supervised Domain Generalization (SSDG), and our proposed Semi-Supervised Multimodal Domain Generalization (SSMDG). SSML overlooks domain shifts, MMDG cannot leverage unlabeled data, and SSDG is restricted to single-modality inputs. (b) Performance comparison in the SSMDG setting, highlighting the limitations of existing paradigms and the effectiveness of our proposed method.
  • Figure 2: Overview of the proposed framework for SSMDG. The framework jointly leverages labeled and unlabeled multimodal data from multiple source domains through three key components: Consensus-Driven Consistency Regularization (CDCR, §\ref{['sec:CDCR']}), Disagreement-Aware Regularization (DAR, §\ref{['sec:DAR']}), and Cross Modal Prototype Alignment (CMPA, §\ref{['sec:CMPA']}), enabling robust generalization to unseen target domains.
  • Figure 3: Pseudo-label accuracy and unlabeled data utilization on the HAC benchmark. Our method consistently achieves superior accuracy while maintaining a higher utilization rate compared to competitive baselines.
  • Figure 4: Ablation analysis of different modules on pseudo-label accuracy and unlabeled data utilization.
  • Figure 5: Visualization of domain shifts in the experimental benchmarks. (a) Example frames from the EPIC-Kitchens dataset across three environments (D1, D2, D3), highlighting variations in viewpoint and illumination. (b) Samples from the HAC dataset (Human, Animal, Cartoon), illustrating stylistic differences across domains.