Table of Contents
Fetching ...

Dual-Stream Cross-Modal Representation Learning via Residual Semantic Decorrelation

Xuecheng Li, Weikuan Jia, Alisher Kurbonaliev, Qurbonaliev Alisher, Khudzhamkulov Rustam, Ismoilov Shuhratjon, Eshmatov Javhariddin, Yuanjie Zheng

TL;DR

This paper tackles the difficulty of learning robust, interpretable cross-modal representations by addressing modality dominance, redundancy, and misalignment. It introduces DSRSD-Net, a Dual-Stream Residual Semantic Decorrelation Network that splits each modality into shared and private streams via a residual decomposition, and enforces decorrelation and orthogonality to structure the shared space. The framework combines a residual semantic projection, gated fusion, and a hybrid cross-modal alignment objective (contrastive plus regression) with a task-specific predictor, achieving consistent gains on two large educational benchmarks for next-step and final outcome predictions. The results demonstrate improved robustness to missing modalities and better cross-domain generalization, highlighting practical impact for learning analytics and safe deployment of multimodal systems.

Abstract

Cross-modal learning has become a fundamental paradigm for integrating heterogeneous information sources such as images, text, and structured attributes. However, multimodal representations often suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations, leading to suboptimal generalization and limited interpretability. In particular, high-variance modalities tend to overshadow weaker but semantically important signals, while naïve fusion strategies entangle modality-shared and modality-specific factors in an uncontrolled manner. This makes it difficult to understand which modality actually drives a prediction and to maintain robustness when some modalities are noisy or missing. To address these challenges, we propose a Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net), a simple yet effective framework that disentangles modality-specific and modality-shared information through residual decomposition and explicit semantic decorrelation constraints. DSRSD-Net introduces: (1) a dual-stream representation learning module that separates intra-modal (private) and inter-modal (shared) latent factors via residual projection; (2) a residual semantic alignment head that maps shared factors from different modalities into a common space using a combination of contrastive and regression-style objectives; and (3) a decorrelation and orthogonality loss that regularizes the covariance structure of the shared space while enforcing orthogonality between shared and private streams, thereby suppressing cross-modal redundancy and preventing feature collapse. Experimental results on two large-scale educational benchmarks demonstrate that DSRSD-Net consistently improves next-step prediction and final outcome prediction over strong single-modality, early-fusion, late-fusion, and co-attention baselines.

Dual-Stream Cross-Modal Representation Learning via Residual Semantic Decorrelation

TL;DR

This paper tackles the difficulty of learning robust, interpretable cross-modal representations by addressing modality dominance, redundancy, and misalignment. It introduces DSRSD-Net, a Dual-Stream Residual Semantic Decorrelation Network that splits each modality into shared and private streams via a residual decomposition, and enforces decorrelation and orthogonality to structure the shared space. The framework combines a residual semantic projection, gated fusion, and a hybrid cross-modal alignment objective (contrastive plus regression) with a task-specific predictor, achieving consistent gains on two large educational benchmarks for next-step and final outcome predictions. The results demonstrate improved robustness to missing modalities and better cross-domain generalization, highlighting practical impact for learning analytics and safe deployment of multimodal systems.

Abstract

Cross-modal learning has become a fundamental paradigm for integrating heterogeneous information sources such as images, text, and structured attributes. However, multimodal representations often suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations, leading to suboptimal generalization and limited interpretability. In particular, high-variance modalities tend to overshadow weaker but semantically important signals, while naïve fusion strategies entangle modality-shared and modality-specific factors in an uncontrolled manner. This makes it difficult to understand which modality actually drives a prediction and to maintain robustness when some modalities are noisy or missing. To address these challenges, we propose a Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net), a simple yet effective framework that disentangles modality-specific and modality-shared information through residual decomposition and explicit semantic decorrelation constraints. DSRSD-Net introduces: (1) a dual-stream representation learning module that separates intra-modal (private) and inter-modal (shared) latent factors via residual projection; (2) a residual semantic alignment head that maps shared factors from different modalities into a common space using a combination of contrastive and regression-style objectives; and (3) a decorrelation and orthogonality loss that regularizes the covariance structure of the shared space while enforcing orthogonality between shared and private streams, thereby suppressing cross-modal redundancy and preventing feature collapse. Experimental results on two large-scale educational benchmarks demonstrate that DSRSD-Net consistently improves next-step prediction and final outcome prediction over strong single-modality, early-fusion, late-fusion, and co-attention baselines.

Paper Structure

This paper contains 39 sections, 16 equations, 5 tables.