Table of Contents
Fetching ...

Structured Spectral Reasoning for Frequency-Adaptive Multimodal Recommendation

Wei Yang, Rui Zhong, Yiqun Chen, Chi Lu, Peng Jiang

TL;DR

Multimodal graph recommendations struggle with modality noise and semantic misalignment, especially under sparsity. The authors propose Structured Spectral Reasoning (SSR), a four-stage, frequency-aware framework comprising spectral decomposition, spectral band masking, a graph-compatible hyperspectral operator, and spectral contrastive regularization to robustly fuse modalities. SSR leverages a graph Fourier transform with equal-energy band construction, a low-rank CP-decomposed operator for cross-band interactions, and training-time perturbations to encourage distribution of evidence across bands, improving performance and robustness, notably in cold-start scenarios. Empirical results on three Amazon datasets demonstrate consistent gains over strong baselines, with extensive ablations and diagnostics validating the contribution of each component and providing interpretable insights into band-level modality interactions.

Abstract

Multimodal recommendation aims to integrate collaborative signals with heterogeneous content such as visual and textual information, but remains challenged by modality-specific noise, semantic inconsistency, and unstable propagation over user-item graphs. These issues are often exacerbated by naive fusion or shallow modeling strategies, leading to degraded generalization and poor robustness. While recent work has explored the frequency domain as a lens to separate stable from noisy signals, most methods rely on static filtering or reweighting, lacking the ability to reason over spectral structure or adapt to modality-specific reliability. To address these challenges, we propose a Structured Spectral Reasoning (SSR) framework for frequency-aware multimodal recommendation. Our method follows a four-stage pipeline: (i) Decompose graph-based multimodal signals into spectral bands via graph-guided transformations to isolate semantic granularity; (ii) Modulate band-level reliability with spectral band masking, a training-time masking with a prediction-consistency objective that suppresses brittle frequency components; (iii) Fuse complementary frequency cues using hyperspectral reasoning with low-rank cross-band interaction; and (iv) Align modality-specific spectral features via contrastive regularization to promote semantic and structural consistency. Experiments on three real-world benchmarks show consistent gains over strong baselines, particularly under sparse and cold-start settings. Additional analyses indicate that structured spectral modeling improves robustness and provides clearer diagnostics of how different bands contribute to performance.

Structured Spectral Reasoning for Frequency-Adaptive Multimodal Recommendation

TL;DR

Multimodal graph recommendations struggle with modality noise and semantic misalignment, especially under sparsity. The authors propose Structured Spectral Reasoning (SSR), a four-stage, frequency-aware framework comprising spectral decomposition, spectral band masking, a graph-compatible hyperspectral operator, and spectral contrastive regularization to robustly fuse modalities. SSR leverages a graph Fourier transform with equal-energy band construction, a low-rank CP-decomposed operator for cross-band interactions, and training-time perturbations to encourage distribution of evidence across bands, improving performance and robustness, notably in cold-start scenarios. Empirical results on three Amazon datasets demonstrate consistent gains over strong baselines, with extensive ablations and diagnostics validating the contribution of each component and providing interpretable insights into band-level modality interactions.

Abstract

Multimodal recommendation aims to integrate collaborative signals with heterogeneous content such as visual and textual information, but remains challenged by modality-specific noise, semantic inconsistency, and unstable propagation over user-item graphs. These issues are often exacerbated by naive fusion or shallow modeling strategies, leading to degraded generalization and poor robustness. While recent work has explored the frequency domain as a lens to separate stable from noisy signals, most methods rely on static filtering or reweighting, lacking the ability to reason over spectral structure or adapt to modality-specific reliability. To address these challenges, we propose a Structured Spectral Reasoning (SSR) framework for frequency-aware multimodal recommendation. Our method follows a four-stage pipeline: (i) Decompose graph-based multimodal signals into spectral bands via graph-guided transformations to isolate semantic granularity; (ii) Modulate band-level reliability with spectral band masking, a training-time masking with a prediction-consistency objective that suppresses brittle frequency components; (iii) Fuse complementary frequency cues using hyperspectral reasoning with low-rank cross-band interaction; and (iv) Align modality-specific spectral features via contrastive regularization to promote semantic and structural consistency. Experiments on three real-world benchmarks show consistent gains over strong baselines, particularly under sparse and cold-start settings. Additional analyses indicate that structured spectral modeling improves robustness and provides clearer diagnostics of how different bands contribute to performance.

Paper Structure

This paper contains 38 sections, 10 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overall architecture of our proposed framework. The model follows a structured four-stage pipeline: (i) Decomposition performs modality-specific graph wavelet transformation to disentangle multi-frequency components; (ii) Modulation applies Spectral Band Masking (SBM) to perturb and down-weight unreliable bands in a task-adaptive manner; (iii) Fusion leverages a low-rank Graph HyperSpectral Neural Operator (G-HSNO) to reason over cross-band and cross-modal dependencies; and (iv) Alignment introduces Spectral Contrastive Regularization (SCR) to enforce semantic consistency and spectral robustness across modalities.
  • Figure 2: Ablation and sensitivity analysis. The left plot shows the impact of removing key components from SSR, validating the effectiveness of each module. The right three plots illustrate the influence of the information bottleneck weight $\lambda$, contrastive loss weight $\eta$, and the number of frequency bands $M$, confirming the stability of SSR under a range of hyperparameter settings.
  • Figure 3: t-SNE visualization of user embeddings across frequency bands. Each subfigure illustrates ID, Visual, and Textual embedding distributions under a specific band.
  • Figure 4: Left: Cross-modality center distances across frequency bands, highlighting increasing alignment in mid-frequency and modality-agnostic fusion in high-frequency regions. Right: KDE plots of frequency gating weights for cold-start vs. non-cold users.