Table of Contents
Fetching ...

Spectral Disentanglement and Enhancement: A Dual-domain Contrastive Framework for Representation Learning

Jinjin Guo, Yexin Li, Zhichao Huang, Jun Fang, Zhiyuan Liu, Chao Liu, Pengzhang Liu, Qixia Jiang

TL;DR

The paper tackles the bottleneck of spectral imbalance in large-scale multimodal contrastive learning by introducing Spectral Disentanglement and Enhancement (SDE). SDE adaptively partitions feature dimensions via real-time SVD into strong, weak, and noise subspaces, applies curriculum-based enhancement to amplify informative components and suppress noise, and couples this with a dual-domain contrastive loss that enforces both instance-level alignment and spectral-structure alignment. The approach yields robust, generalizable representations, outperforming state-of-the-art baselines on MMEB across classification, VQA, retrieval, and grounding, and demonstrates strong cross-task transfer. By integrating spectral regularization into the training loop, SDE provides a scalable, practical solution for improving multimodal representation learning with large VLM backbones.

Abstract

Large-scale multimodal contrastive learning has recently achieved impressive success in learning rich and transferable representations, yet it remains fundamentally limited by the uniform treatment of feature dimensions and the neglect of the intrinsic spectral structure of the learned features. Empirical evidence indicates that high-dimensional embeddings tend to collapse into narrow cones, concentrating task-relevant semantics in a small subspace, while the majority of dimensions remain occupied by noise and spurious correlations. Such spectral imbalance and entanglement undermine model generalization. We propose Spectral Disentanglement and Enhancement (SDE), a novel framework that bridges the gap between the geometry of the embedded spaces and their spectral properties. Our approach leverages singular value decomposition to adaptively partition feature dimensions into strong signals that capture task-critical semantics, weak signals that reflect ancillary correlations, and noise representing irrelevant perturbations. A curriculum-based spectral enhancement strategy is then applied, selectively amplifying informative components with theoretical guarantees on training stability. Building upon the enhanced features, we further introduce a dual-domain contrastive loss that jointly optimizes alignment in both the feature and spectral spaces, effectively integrating spectral regularization into the training process and encouraging richer, more robust representations. Extensive experiments on large-scale multimodal benchmarks demonstrate that SDE consistently improves representation robustness and generalization, outperforming state-of-the-art methods. SDE integrates seamlessly with existing contrastive pipelines, offering an effective solution for multimodal representation learning.

Spectral Disentanglement and Enhancement: A Dual-domain Contrastive Framework for Representation Learning

TL;DR

The paper tackles the bottleneck of spectral imbalance in large-scale multimodal contrastive learning by introducing Spectral Disentanglement and Enhancement (SDE). SDE adaptively partitions feature dimensions via real-time SVD into strong, weak, and noise subspaces, applies curriculum-based enhancement to amplify informative components and suppress noise, and couples this with a dual-domain contrastive loss that enforces both instance-level alignment and spectral-structure alignment. The approach yields robust, generalizable representations, outperforming state-of-the-art baselines on MMEB across classification, VQA, retrieval, and grounding, and demonstrates strong cross-task transfer. By integrating spectral regularization into the training loop, SDE provides a scalable, practical solution for improving multimodal representation learning with large VLM backbones.

Abstract

Large-scale multimodal contrastive learning has recently achieved impressive success in learning rich and transferable representations, yet it remains fundamentally limited by the uniform treatment of feature dimensions and the neglect of the intrinsic spectral structure of the learned features. Empirical evidence indicates that high-dimensional embeddings tend to collapse into narrow cones, concentrating task-relevant semantics in a small subspace, while the majority of dimensions remain occupied by noise and spurious correlations. Such spectral imbalance and entanglement undermine model generalization. We propose Spectral Disentanglement and Enhancement (SDE), a novel framework that bridges the gap between the geometry of the embedded spaces and their spectral properties. Our approach leverages singular value decomposition to adaptively partition feature dimensions into strong signals that capture task-critical semantics, weak signals that reflect ancillary correlations, and noise representing irrelevant perturbations. A curriculum-based spectral enhancement strategy is then applied, selectively amplifying informative components with theoretical guarantees on training stability. Building upon the enhanced features, we further introduce a dual-domain contrastive loss that jointly optimizes alignment in both the feature and spectral spaces, effectively integrating spectral regularization into the training process and encouraging richer, more robust representations. Extensive experiments on large-scale multimodal benchmarks demonstrate that SDE consistently improves representation robustness and generalization, outperforming state-of-the-art methods. SDE integrates seamlessly with existing contrastive pipelines, offering an effective solution for multimodal representation learning.
Paper Structure (29 sections, 12 equations, 4 figures, 3 tables)

This paper contains 29 sections, 12 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the SDE framework. The VLM backbone jointly encodes the query, consisting of an image and text, and the target text input to produce multimodal feature representations. These features are then partitioned into strong, weak, and noise subspaces via SVD, with each subspace adaptively enhanced and reconstructed back to the feature space. Finally, a dual-domain contrastive loss—comprising instance-level alignment in the feature space and structure-aware alignment in the spectral space—is applied to improve both robustness and generalization.
  • Figure 2: Cross-task generalization performance comparison between VLM2Vec and SDE. Each subplot shows the performance when trained on one meta-task, e.g., VQA, classification, or retrieval, and evaluated on other unseen tasks. Notably, SDE demonstrates superior generalization capabilities across all scenarios. Both models employ Qwen2-VL-2B as the backbone.
  • Figure 3: Qualitative examples of spectral disentanglement and enhancement. (a) Evolution of component proportions shows increasing dominance of strong signals. (b) Singular value distributions before and after enhancement, demonstrating selective amplification of meaningful features. (c) Cumulative energy distribution highlighting the concentration of semantic information in strong components.
  • Figure 4: Hyperparameter scheduling patterns during training: (a) illustrates the decay of the curriculum factor $\alpha(t)$, while (b) shows the dynamically scheduled weighting coefficient $\lambda(t)$.