Table of Contents
Fetching ...

Modality-Guided Mixture of Graph Experts with Entropy-Triggered Routing for Multimodal Recommendation

Ji Dai, Quan Fang, Dengsheng Cai

TL;DR

MAGNET couples interaction-conditioned expert routing with structure-aware graph augmentation, so that both both what to fuse and how to fuse are explicitly controlled and interpretable in multimodal fusion.

Abstract

Multimodal recommendation enhances ranking by integrating user-item interactions with item content, which is particularly effective under sparse feedback and long-tail distributions. However, multimodal signals are inherently heterogeneous and can conflict in specific contexts, making effective fusion both crucial and challenging. Existing approaches often rely on shared fusion pathways, leading to entangled representations and modality imbalance. To address these issues, we propose MAGNET, a Modality-Guided Mixture of Adaptive Graph Experts Network with Progressive Entropy-Triggered Routing for Multimodal Recommendation, designed to enhance controllability, stability, and interpretability in multimodal fusion. MAGNET couples interaction-conditioned expert routing with structure-aware graph augmentation, so that both what to fuse and how to fuse are explicitly controlled and interpretable. At the representation level, a dual-view graph learning module augments the interaction graph with content-induced edges, improving coverage for sparse and long-tail items while preserving collaborative structure via parallel encoding and lightweight fusion. At the fusion level, MAGNET employs structured experts with explicit modality roles-dominant, balanced, and complementary-enabling a more interpretable and adaptive combination of behavioral, visual, and textual cues. To further stabilize sparse routing and prevent expert collapse, we introduce a two-stage entropy-weighting mechanism that monitors routing entropy. This mechanism automatically transitions training from an early coverage-oriented regime to a later specialization-oriented regime, progressively balancing expert utilization and routing confidence. Extensive experiments on public benchmarks demonstrate consistent improvements over strong baselines.

Modality-Guided Mixture of Graph Experts with Entropy-Triggered Routing for Multimodal Recommendation

TL;DR

MAGNET couples interaction-conditioned expert routing with structure-aware graph augmentation, so that both both what to fuse and how to fuse are explicitly controlled and interpretable in multimodal fusion.

Abstract

Multimodal recommendation enhances ranking by integrating user-item interactions with item content, which is particularly effective under sparse feedback and long-tail distributions. However, multimodal signals are inherently heterogeneous and can conflict in specific contexts, making effective fusion both crucial and challenging. Existing approaches often rely on shared fusion pathways, leading to entangled representations and modality imbalance. To address these issues, we propose MAGNET, a Modality-Guided Mixture of Adaptive Graph Experts Network with Progressive Entropy-Triggered Routing for Multimodal Recommendation, designed to enhance controllability, stability, and interpretability in multimodal fusion. MAGNET couples interaction-conditioned expert routing with structure-aware graph augmentation, so that both what to fuse and how to fuse are explicitly controlled and interpretable. At the representation level, a dual-view graph learning module augments the interaction graph with content-induced edges, improving coverage for sparse and long-tail items while preserving collaborative structure via parallel encoding and lightweight fusion. At the fusion level, MAGNET employs structured experts with explicit modality roles-dominant, balanced, and complementary-enabling a more interpretable and adaptive combination of behavioral, visual, and textual cues. To further stabilize sparse routing and prevent expert collapse, we introduce a two-stage entropy-weighting mechanism that monitors routing entropy. This mechanism automatically transitions training from an early coverage-oriented regime to a later specialization-oriented regime, progressively balancing expert utilization and routing confidence. Extensive experiments on public benchmarks demonstrate consistent improvements over strong baselines.
Paper Structure (65 sections, 34 equations, 14 figures, 7 tables, 2 algorithms)

This paper contains 65 sections, 34 equations, 14 figures, 7 tables, 2 algorithms.

Figures (14)

  • Figure 1: Illustration of how consumers integrate multiple signals to make purchasing decisions: the dress looks visually appealing, but negative reviews mention discomfort, and her past experience with the same brand was neutral. The final choice reflects a balance of these factors.
  • Figure 2: Overview of our proposed MAGNET framework. The first row presents the end-to-end pipeline: (I) inputs user--item interactions and item-side visual/text features; (II) constructs content-induced edges via similarity and KNN retrieval to augment the graph; (III) performs dual-view encoding on observed and augmented views and fuses them into unified user/item representations; (IV) applies a routing-based sparse triplet MoE as the prediction head, routing each query to a sparse set of experts and aggregating their outputs under a unified training objective. The second row provides complementary details: (A) illustrates the triplet-template expert pool covering behavior/appearance/semantics patterns, and (B) shows the progressive entropy-guided routing schedule that transitions from exploration to specialization during training.
  • Figure 3: Hyper-parameter sensitivity of MAGNET-DV with respect to the triplet-template controls $(\alpha,\beta,\delta)$. Each subplot adopts a zoomed y-axis range to reveal subtle yet consistent performance variations. Hollow markers indicate the default setting used in all main experiments. We sweep each control over a discrete set.
  • Figure 4: Sensitivity of MAGNET to expert capacity $E$ and Top-$K$ routing (metric: $N@20$).Upper: family-combination under $E\le 9$ (E1--E5). Lower: expert-splitting with $E{=}9p$ (E1--E5). K1--K6 denote Top-$K$ routing with $K\in\{1,\dots,6\}$; invalid cells with $K>E$ are marked as "$\backslash$".
  • Figure 5: Analysis of 9-expert routing and usage patterns across domains and modalities. Left: Expert usage radar over a 9-expert pool. Middle: Modality reliance and coverage statistics. Right: Fusion regime composition and diversity metrics.
  • ...and 9 more figures