Table of Contents
Fetching ...

Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations

Luyi Ma, Xiaohan Li, Zezhong Fan, Kai Zhao, Jianpeng Xu, Jason Cho, Praveen Kanumala, Kaushiki Nag, Sushant Kumar, Kannan Achan

TL;DR

The paper tackles multi-behavior recommendations by fusing visual, textual, and graph modalities into a single LLM-based recommender. It introduces the Triple Modality Fusion (TMF) framework, featuring All-Modality Self-Attention (AMSA) and Cross-Modality Attention (CMA) to align multi-source signals in a shared embedding space, along with modality-aware prompts and instruction tuning via LoRA. Empirical results on Walmart datasets show TMF achieves significant improvements over traditional, graph-based, and prior LLM-based baselines, with extensive ablations and human evaluations validating the design. The work demonstrates the practical viability and effectiveness of integrating triple modalities for enhanced personalization and behavior modeling in production settings.

Abstract

Integrating diverse data modalities is crucial for enhancing the performance of personalized recommendation systems. Traditional models, which often rely on singular data sources, lack the depth needed to accurately capture the multifaceted nature of item features and user behaviors. This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality, which is visual, textual, and graph data through alignment with large language models (LLMs). By incorporating visual information, we capture contextual and aesthetic item characteristics; textual data provides insights into user interests and item features in detail; and graph data elucidates relationships within the item-behavior heterogeneous graphs. Our proposed model called Triple Modality Fusion (TMF) utilizes the power of LLMs to align and integrate these three modalities, achieving a comprehensive representation of user behaviors. The LLM models the user's interactions including behaviors and item features in natural languages. Initially, the LLM is warmed up using only natural language-based prompts. We then devise the modality fusion module based on cross-attention and self-attention mechanisms to integrate different modalities from other models into the same embedding space and incorporate them into an LLM. Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy. Further ablation studies validate the effectiveness of our model design and benefits of the TMF.

Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations

TL;DR

The paper tackles multi-behavior recommendations by fusing visual, textual, and graph modalities into a single LLM-based recommender. It introduces the Triple Modality Fusion (TMF) framework, featuring All-Modality Self-Attention (AMSA) and Cross-Modality Attention (CMA) to align multi-source signals in a shared embedding space, along with modality-aware prompts and instruction tuning via LoRA. Empirical results on Walmart datasets show TMF achieves significant improvements over traditional, graph-based, and prior LLM-based baselines, with extensive ablations and human evaluations validating the design. The work demonstrates the practical viability and effectiveness of integrating triple modalities for enhanced personalization and behavior modeling in production settings.

Abstract

Integrating diverse data modalities is crucial for enhancing the performance of personalized recommendation systems. Traditional models, which often rely on singular data sources, lack the depth needed to accurately capture the multifaceted nature of item features and user behaviors. This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality, which is visual, textual, and graph data through alignment with large language models (LLMs). By incorporating visual information, we capture contextual and aesthetic item characteristics; textual data provides insights into user interests and item features in detail; and graph data elucidates relationships within the item-behavior heterogeneous graphs. Our proposed model called Triple Modality Fusion (TMF) utilizes the power of LLMs to align and integrate these three modalities, achieving a comprehensive representation of user behaviors. The LLM models the user's interactions including behaviors and item features in natural languages. Initially, the LLM is warmed up using only natural language-based prompts. We then devise the modality fusion module based on cross-attention and self-attention mechanisms to integrate different modalities from other models into the same embedding space and incorporate them into an LLM. Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy. Further ablation studies validate the effectiveness of our model design and benefits of the TMF.

Paper Structure

This paper contains 22 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The Triple Modality Fusion (TMF) framework for multi-behavior recommendation. The blue squares are frozen models, and the red models are open to training in the training steps.
  • Figure 2: The Modality Fusion module in Figure \ref{['tmf-framework']}.
  • Figure 3: Task complexity and prompt examples: (a) the text-only task only uses the text prompt with item names and behaviors for user context. (b) and (c) are more challenging tasks to gradually introduce more modalities and tokens into the prompts.
  • Figure 4: Case Study on Sports and Electronics datasets to demonstrate the TMF's capacity on context understanding (shopping topics, user age and gender requirement, design, etc.) and reasoning for next purchase prediction.
  • Figure 5: Distribution of Human Rating Scores in a bubble chart.