FANoise: Singular Value-Adaptive Noise Modulation for Robust Multimodal Representation Learning
Jiaoyang Li, Jun Fang, Tianhao Gao, Xiaohui Zhang, Zhiyuan Liu, Chao Liu, Pengzhang Liu, Qixia Jiang
TL;DR
The paper addresses the challenge of robust multimodal representation learning by identifying limitations of static, uniform noise and proposing FANoise, a feature-adaptive noise injection framework guided by the spectral structure of feature representations. It grounds the design in a theoretical analysis of InfoNCE gradients and spectral perturbation, introducing a two-stage process that uses singular value decomposition to adapt noise across principal directions and maintain robust signal-to-noise ratios. Empirically, FANoise (notably with sublinear scaling) improves performance across multiple backbones on the Massive Multimodal Embedding Benchmark, demonstrating stronger generalization to both in-distribution and out-of-distribution tasks while remaining computationally efficient. The work provides actionable insights and a practical, plug-and-play approach to enhance robustness and generalization in multimodal contrastive learning, with potential applicability beyond the MMEB domain.
Abstract
Representation learning is fundamental to modern machine learning, powering applications such as text retrieval and multimodal understanding. However, learning robust and generalizable representations remains challenging. While prior work has demonstrated that active noise injection, a form of data augmentation, can enhance encoding performance, most existing methods rely on heuristic or static noise, overlooking the dynamic nature of feature distributions during training. In this work, we systematically study the role of noise in representation learning from both gradient-based and feature distribution perspectives, using InfoNCE loss as a representative example. Focusing on multimodal representation learning, we propose FANoise, a novel feature-adaptive noise injection strategy. By leveraging the dynamics of contrastive learning, FANoise effectively mitigates the negative impacts of noise while preserving its benefits. Under this theoretically grounded framework, comprehensive experiments demonstrate that FANoise consistently improves overall performance on multimodal tasks across various base VLM models.
