Table of Contents
Fetching ...

Countering Multi-modal Representation Collapse through Rank-targeted Fusion

Seulgi Kim, Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib

TL;DR

This work tackles representation collapse in multi-modal fusion by introducing a rank-targeted approach that uses effective rank as a unifying objective. The Rank-enhancing Token Fuser (RTF) selectively blends low-informative channels with complementary signals from another modality, provably increasing the fused representation's effective rank while preserving the dominant subspace. Building on this, the depth-informed R3D architecture demonstrates that depth is highly complementary to RGB for action anticipation, achieving state-of-the-art results across NTURGBD, UTKinect, and DARai and exhibiting robustness to noise. The method offers practical, scalable improvements for multi-modal perception tasks by fostering mutual rank gains and balanced cross-modal information exchange.

Abstract

Multi-modal fusion methods often suffer from two types of representation collapse: feature collapse where individual dimensions lose their discriminative power (as measured by eigenspectra), and modality collapse where one dominant modality overwhelms the other. Applications like human action anticipation that require fusing multifarious sensor data are hindered by both feature and modality collapse. However, existing methods attempt to counter feature collapse and modality collapse separately. This is because there is no unifying framework that efficiently addresses feature and modality collapse in conjunction. In this paper, we posit the utility of effective rank as an informative measure that can be utilized to quantify and counter both the representation collapses. We propose \textit{Rank-enhancing Token Fuser}, a theoretically grounded fusion framework that selectively blends less informative features from one modality with complementary features from another modality. We show that our method increases the effective rank of the fused representation. To address modality collapse, we evaluate modality combinations that mutually increase each others' effective rank. We show that depth maintains representational balance when fused with RGB, avoiding modality collapse. We validate our method on action anticipation, where we present \texttt{R3D}, a depth-informed fusion framework. Extensive experiments on NTURGBD, UTKinect, and DARai demonstrate that our approach significantly outperforms prior state-of-the-art methods by up to 3.74\%. Our code is available at: \href{https://github.com/olivesgatech/R3D}{https://github.com/olivesgatech/R3D}.

Countering Multi-modal Representation Collapse through Rank-targeted Fusion

TL;DR

This work tackles representation collapse in multi-modal fusion by introducing a rank-targeted approach that uses effective rank as a unifying objective. The Rank-enhancing Token Fuser (RTF) selectively blends low-informative channels with complementary signals from another modality, provably increasing the fused representation's effective rank while preserving the dominant subspace. Building on this, the depth-informed R3D architecture demonstrates that depth is highly complementary to RGB for action anticipation, achieving state-of-the-art results across NTURGBD, UTKinect, and DARai and exhibiting robustness to noise. The method offers practical, scalable improvements for multi-modal perception tasks by fostering mutual rank gains and balanced cross-modal information exchange.

Abstract

Multi-modal fusion methods often suffer from two types of representation collapse: feature collapse where individual dimensions lose their discriminative power (as measured by eigenspectra), and modality collapse where one dominant modality overwhelms the other. Applications like human action anticipation that require fusing multifarious sensor data are hindered by both feature and modality collapse. However, existing methods attempt to counter feature collapse and modality collapse separately. This is because there is no unifying framework that efficiently addresses feature and modality collapse in conjunction. In this paper, we posit the utility of effective rank as an informative measure that can be utilized to quantify and counter both the representation collapses. We propose \textit{Rank-enhancing Token Fuser}, a theoretically grounded fusion framework that selectively blends less informative features from one modality with complementary features from another modality. We show that our method increases the effective rank of the fused representation. To address modality collapse, we evaluate modality combinations that mutually increase each others' effective rank. We show that depth maintains representational balance when fused with RGB, avoiding modality collapse. We validate our method on action anticipation, where we present \texttt{R3D}, a depth-informed fusion framework. Extensive experiments on NTURGBD, UTKinect, and DARai demonstrate that our approach significantly outperforms prior state-of-the-art methods by up to 3.74\%. Our code is available at: \href{https://github.com/olivesgatech/R3D}{https://github.com/olivesgatech/R3D}.

Paper Structure

This paper contains 36 sections, 2 theorems, 48 equations, 11 figures, 9 tables.

Key Result

Theorem 3.1

Let $u_1, \dots, u_k \in \mathbb{R}^T$ denote the top-$k$ left singular vectors of $X$, and define $\delta_k := \sigma_k - \sigma_{k+1}$ as the singular value gap, which quantifies the separation between the dominant subspace (top-$k$) and the residual space. Assume: Then $y_c$ introduces novel directions in the feature space of $X$, hence the effective rank satisfies $\mathrm{ERank}(X') > \mathr

Figures (11)

  • Figure 1: This is a toy figure describing feature and modality collapse using spectral decomposition. (e): The ideal fused representation preserves complementary eigenvectors from both data modalities. (f): In contrast, feature collapse occurs when the fused representation varies along a subset of eigenvectors. (g): Modality collapse occurs when one modality dominates and suppresses the contribution of the eigenvectors of the other modality.
  • Figure 2: This figure compares the eigenvalue spectra of each modality before (Depth - blue, RGB - red) and after fusion (green) using the formulation in Theorem \ref{['thm:main']}. The left column shows the spectrum for the Depth modality, and the right column for RGB. Across all datasets and label granularities, the fused modality consistently exhibits a flatter spectrum in mid-to-lower components as well as the dominant ones.
  • Figure 3: This figure compares the Harmonic Mean of effective rank gain across four modalities: Multi-view RGB, Text, IMU, and Depth. Depth consistently achieves the highest harmonic mean across all observation rates, indicating a more balanced interaction with RGB compared to other modalities.
  • Figure 4: The detailed architecture of R3D. It comprises three components: the Rank-Enhancing Token Fuser (RTF), the Temporal Fuser, and the Action Anticipation Module. The RTF compensates for less informative channels in each modality by blending complementary information, while the Temporal Fuser captures continuous temporal dependencies and segments each timestamp. Finally, the Action Anticipation Module predicts future actions based on the integrated multi-modal information.
  • Figure 5: Ablation study on UTKinects and DARai dataset examining the impact of the proportion of exchanged channels (10%, 20%, 30%) in Token Fuser.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Theorem 3.1: Channel Fusion Increases Effective Rank
  • Lemma A.1: Stability of Dominant Subspace; see Theorem 1 in o2023matrices
  • proof