Table of Contents
Fetching ...

Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

Donghuo Zeng, Hao Niu, Masato Taya

Abstract

Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain discriminative semantic content under partial observation. Concretely, a student MAE path is trained with masked feature reconstruction and affinity-weighted soft top-k InfoNCE; an EMA teacher operating on unmasked inputs via the CCA path supplies stable canonical geometry and soft positives. Learnable multi-task weights reconcile competing objectives, and an optional distillation loss transfers teacher geometry into the student. Experiments on AVE and VEGAS demonstrate substantial mAP improvements over strong unsupervised baselines, validating that HSC-MAE yields robust and well-structured audio-visual representations.

Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning

Abstract

Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain discriminative semantic content under partial observation. Concretely, a student MAE path is trained with masked feature reconstruction and affinity-weighted soft top-k InfoNCE; an EMA teacher operating on unmasked inputs via the CCA path supplies stable canonical geometry and soft positives. Learnable multi-task weights reconcile competing objectives, and an optional distillation loss transfers teacher geometry into the student. Experiments on AVE and VEGAS demonstrate substantial mAP improvements over strong unsupervised baselines, validating that HSC-MAE yields robust and well-structured audio-visual representations.

Paper Structure

This paper contains 25 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the HSC-MAE architecture. Pre-extracted audio and visual features are processed by shared encoders and a cross-attention fusion block under two coordinated training modes. The student MAE path applies sample-level value masking and is optimized by reconstruction, a teacher-guided soft top-$k$ InfoNCE, producing robust embeddings $(Z_a, Z_v)$. In parallel, the EMA teacher CCA path preserves input values and enforces global cross-modal alignment via DCCA, yielding clean embeddings $(Z'_a, Z'_v)$. The teacher provides stable semantic affinities and geometric targets for neighborhood mining and distillation, while gradients are blocked from flowing into the teacher.
  • Figure 2: (Left) Decomposition of training losses; (Right) test mAP over epochs (1-100) on AVE under component-wise ablations.
  • Figure 3: Effect of mask ratio on UCMR task on both AVE and VEGAS datasets. Solid curves show the average mAP, while shaded regions indicate the absolute gap between the two retrieval directions.
  • Figure 4: Qualitative audio--visual cross-modal retrieval results on AVE. For each query (audio or visual), the top-10 retrieved results are shown.