Table of Contents
Fetching ...

User-Aware Conditional Generative Total Correlation Learning for Multi-Modal Recommendation

Jing Du, Zesheng Ye, Congbo Ma, Feng Liu, Flora. D. Salim

Abstract

Multi-modal recommendation (MMR) enriches item representations by introducing item content, e.g., visual and textual descriptions, to improve upon interaction-only recommenders. The success of MMR hinges on aligning these content modalities with user preferences derived from interaction data, yet dominant practices based on disentangling modality-invariant preference-driving signals from modality-specific preference-irrelevant noises are flawed. First, they assume a one-size-fits-all relevance of item content to user preferences for all users, which contradicts the user-conditional fact of preferences. Second, they optimize pairwise contrastive losses separately toward cross-modal alignment, systematically ignoring higher-order dependencies inherent when multiple content modalities jointly influence user choices. In this paper, we introduce GTC, a conditional Generative Total Correlation learning framework. We employ an interaction-guided diffusion model to perform user-aware content feature filtering, preserving only personalized features relevant to each individual user. Furthermore, to capture complete cross-modal dependencies, we optimize a tractable lower bound of the total correlation of item representations across all modalities. Experiments on standard MMR benchmarks show GTC consistently outperforms state-of-the-art, with gains of up to 28.30% in NDCG@5. Ablation studies validate both conditional preference-driven feature filtering and total correlation optimization, confirming the ability of GTC to model user-conditional relationships in MMR tasks. The code is available at: https://github.com/jingdu-cs/GTC.

User-Aware Conditional Generative Total Correlation Learning for Multi-Modal Recommendation

Abstract

Multi-modal recommendation (MMR) enriches item representations by introducing item content, e.g., visual and textual descriptions, to improve upon interaction-only recommenders. The success of MMR hinges on aligning these content modalities with user preferences derived from interaction data, yet dominant practices based on disentangling modality-invariant preference-driving signals from modality-specific preference-irrelevant noises are flawed. First, they assume a one-size-fits-all relevance of item content to user preferences for all users, which contradicts the user-conditional fact of preferences. Second, they optimize pairwise contrastive losses separately toward cross-modal alignment, systematically ignoring higher-order dependencies inherent when multiple content modalities jointly influence user choices. In this paper, we introduce GTC, a conditional Generative Total Correlation learning framework. We employ an interaction-guided diffusion model to perform user-aware content feature filtering, preserving only personalized features relevant to each individual user. Furthermore, to capture complete cross-modal dependencies, we optimize a tractable lower bound of the total correlation of item representations across all modalities. Experiments on standard MMR benchmarks show GTC consistently outperforms state-of-the-art, with gains of up to 28.30% in NDCG@5. Ablation studies validate both conditional preference-driven feature filtering and total correlation optimization, confirming the ability of GTC to model user-conditional relationships in MMR tasks. The code is available at: https://github.com/jingdu-cs/GTC.

Paper Structure

This paper contains 39 sections, 12 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The representations from user-item interactions exhibit low cosine similarity (left) and high Euclidean distance (right) to representations from visual and textual content modalities, on Amazon Sports dataset. In contrast, visual and textual representations are more aligned, implying the gap between latent user preference and explicit item attributes.
  • Figure 2: The user-conditional nature of "appealing" feature relevance. The item on the right appeals to a user who likes "orange" and "cotton", while a different user is more interested in the brand ("AW Apparel") and style, for whom color and fabric are less important. Thus, what constitutes a preference-driving signal is not only determined by the item attributes themselves, but also by individual user preference.
  • Figure 3: Overall illustration of the proposed GTC framework.
  • Figure 4: Impact of content features in Sports (up), Baby (middle), and Cell (bottom) Datasets.
  • Figure 5: Modality balance trend during training GTC.
  • ...and 2 more figures