Table of Contents
Fetching ...

Structured-Noise Masked Modeling for Video, Audio and Beyond

Aritra Bhowmik, Fida Mohammad Thoker, Carlos Hinojosa, Bernard Ghanem, Cees G. M. Snoek

TL;DR

The paper tackles the problem that random masking in self-supervised masked modeling ignores modality-specific structure. It introduces structured-noise masking, generating masks by filtering white noise into color noise patterns: Green3D noise for video (spatiotemporal structure), Optim Blue noise for audio (uniform patch distribution in spectrograms), and combined audio-visual masking. Key contributions include three new masking schemes, extensive cross-modal evaluations (video action recognition, video object segmentation, audio classification, and audio-visual classification), and detailed ablations showing the importance of mask color, 3D vs 2D masking, and masking ratio. The findings demonstrate that modality-aware masking improves representation learning with no computational overhead, suggesting broad applicability to self-supervised learning pipelines across vision, audio, and multimodal domains.

Abstract

Masked modeling has emerged as a powerful self-supervised learning framework, but existing methods largely rely on random masking, disregarding the structural properties of different modalities. In this work, we introduce structured noise-based masking, a simple yet effective approach that naturally aligns with the spatial, temporal, and spectral characteristics of video and audio data. By filtering white noise into distinct color noise distributions, we generate structured masks that preserve modality-specific patterns without requiring handcrafted heuristics or access to the data. Our approach improves the performance of masked video and audio modeling frameworks without any computational overhead. Extensive experiments demonstrate that structured noise masking achieves consistent improvement over random masking for standard and advanced masked modeling methods, highlighting the importance of modality-aware masking strategies for representation learning.

Structured-Noise Masked Modeling for Video, Audio and Beyond

TL;DR

The paper tackles the problem that random masking in self-supervised masked modeling ignores modality-specific structure. It introduces structured-noise masking, generating masks by filtering white noise into color noise patterns: Green3D noise for video (spatiotemporal structure), Optim Blue noise for audio (uniform patch distribution in spectrograms), and combined audio-visual masking. Key contributions include three new masking schemes, extensive cross-modal evaluations (video action recognition, video object segmentation, audio classification, and audio-visual classification), and detailed ablations showing the importance of mask color, 3D vs 2D masking, and masking ratio. The findings demonstrate that modality-aware masking improves representation learning with no computational overhead, suggesting broad applicability to self-supervised learning pipelines across vision, audio, and multimodal domains.

Abstract

Masked modeling has emerged as a powerful self-supervised learning framework, but existing methods largely rely on random masking, disregarding the structural properties of different modalities. In this work, we introduce structured noise-based masking, a simple yet effective approach that naturally aligns with the spatial, temporal, and spectral characteristics of video and audio data. By filtering white noise into distinct color noise distributions, we generate structured masks that preserve modality-specific patterns without requiring handcrafted heuristics or access to the data. Our approach improves the performance of masked video and audio modeling frameworks without any computational overhead. Extensive experiments demonstrate that structured noise masking achieves consistent improvement over random masking for standard and advanced masked modeling methods, highlighting the importance of modality-aware masking strategies for representation learning.

Paper Structure

This paper contains 27 sections, 9 equations, 8 figures, 17 tables, 1 algorithm.

Figures (8)

  • Figure 1: Structured noise masking for video. Traditional random masking disrupts temporal consistency, leading to abrupt masking across frames. In contrast, our Green 3D noise introduces structured masking that evolves smoothly over time, preserving motion continuity. This enables the model to learn richer spatiotemporal representations while maintaining a challenging reconstruction task.
  • Figure 2: Generated masks from 2D random ($n_w$), blue ($n_b$), green ($n_g$), and red ($n_r$) noise, where $\eta$ corresponds to the same masking generator function used in he2022maskedhinojosa2024colormae. These masks capture spatial structure but lack temporal consistency, limiting their suitability for video data.
  • Figure 3: Unlike traditional random tube masking, which enforces strict temporal consistency, our proposed Green 3D masking generates structured random masks that evolve smoothly across consecutive frames. This smooth evolution prevents abrupt masking changes, enabling the model to better capture natural temporal dynamics and continuity in video data.
  • Figure 4: (left) Illustration of the metric used to determine the concentration of visible patches in a window $U^i_P$ of the mask $M^i_{x,y}$. (right) Example of the initial mask ($M^i$), with clusters of visible patches, and final mask ($\hat{M}^{i}_{b}$) obtained with our 2D blue noise masking algorithm, with uniformly distributed visible patches. Note the improved uniformity in the final mask, ensuring better coverage and reducing undesirable clustering effects.
  • Figure A.1: Comparison of different masking strategies in VideoMAE pretraining on SSv2 videos (masking ratio 0.75). Standard tube masking struggles to align with video structures, while 2D noise-based masking introduces some spatial coherence but lacks temporal consistency. Our proposed 3D Green masking effectively captures spatiotemporal structures, preserving motion continuity across frames.
  • ...and 3 more figures