Collaborative Cross-modal Fusion with Large Language Model for Recommendation

Zhongzhou Liu; Hao Zhang; Kuicai Dong; Yuan Fang

Collaborative Cross-modal Fusion with Large Language Model for Recommendation

Zhongzhou Liu, Hao Zhang, Kuicai Dong, Yuan Fang

TL;DR

The paper tackles the gap in LLM-based recommendations where semantic knowledge from textual attributes and collaborative signals from user-item interactions are not fully exploited together. It introduces CCF-LLM, which translates interactions into a hybrid prompt and uses an attentive cross-modal fusion (mapping via ALG and fusion via GATE) to align CF embeddings with LLM space, trained in a two-stage LoRA-tuned regime. Across two public datasets, CCF-LLM consistently outperforms conventional CF baselines and existing LLM4Rec methods, with ablations confirming the importance of dimension-wise fusion and staged training. The approach demonstrates that jointly modeling semantic and collaborative cues yields significant CTR improvements and provides a robust framework for integrating multiple modalities in recommendation systems.

Abstract

Despite the success of conventional collaborative filtering (CF) approaches for recommendation systems, they exhibit limitations in leveraging semantic knowledge within the textual attributes of users and items. Recent focus on the application of large language models for recommendation (LLM4Rec) has highlighted their capability for effective semantic knowledge capture. However, these methods often overlook the collaborative signals in user behaviors. Some simply instruct-tune a language model, while others directly inject the embeddings of a CF-based model, lacking a synergistic fusion of different modalities. To address these issues, we propose a framework of Collaborative Cross-modal Fusion with Large Language Models, termed CCF-LLM, for recommendation. In this framework, we translate the user-item interactions into a hybrid prompt to encode both semantic knowledge and collaborative signals, and then employ an attentive cross-modal fusion strategy to effectively fuse latent embeddings of both modalities. Extensive experiments demonstrate that CCF-LLM outperforms existing methods by effectively utilizing semantic and collaborative signals in the LLM4Rec context.

Collaborative Cross-modal Fusion with Large Language Model for Recommendation

TL;DR

Abstract

Paper Structure (39 sections, 7 equations, 4 figures, 5 tables)

This paper contains 39 sections, 7 equations, 4 figures, 5 tables.

Introduction
Related Work
CF-based Recommendation System
LLM4Rec Approaches
Methodology
Preliminaries
Task Formulation
Conventional CF-based recommendation.
Hybrid Prompt Translation
Attentive Cross-modal Fusion Strategy
Mapping Phase
Fusion Phase
Training
Learning Objectives
Two-stage Training
...and 24 more sections

Figures (4)

Figure 1: An illustration of heterogeneous characteristics between the semantic knowledge from LLMs and the collaborative signals from conventional recommendation systems.
Figure 2: The overall framework of the proposed Collaborative Cross-modal Fusion with Large Language Model (CCF-LLM).
Figure 3: Ablation study on cross-modal fusion strategies.
Figure 4: Visualization of the item embeddings in different modalities with t-SNE. Green: Aligned CF embeddings; Yellow: LLM embeddings; Purple: fused embeddings.

Collaborative Cross-modal Fusion with Large Language Model for Recommendation

TL;DR

Abstract

Collaborative Cross-modal Fusion with Large Language Model for Recommendation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)