Federated Vision-Language-Recommendation with Personalized Fusion
Zhiwei Li, Guodong Long, Jing Jiang, Chengqi Zhang, Qiang Yang
TL;DR
FedVLR tackles personalized multimodal fusion for on-device Vision-Language-Recommendation in a federated setting. It decouples server-side view generation from client-side refinement via a Bi-Level Fusion Mechanism that uses diverse fusion operators and a Mixture-of-Experts router to tailor representations to individual user histories. The approach is supported by convergence and complexity analyses and validated across seven diverse datasets, where it outperforms standard federated baselines and rivals centralized models in low-data regimes, while preserving privacy. This framework enables privacy-preserving, content-aware recommendations by leveraging visual, textual, and collaborative signals with efficient on-device personalization.
Abstract
Applying large pre-trained Vision-Language Models to recommendation is a burgeoning field, a direction we term Vision-Language-Recommendation (VLR). Bringing VLR to user-oriented on-device intelligence within a federated learning framework is a crucial step for enhancing user privacy and delivering personalized experiences. This paper introduces FedVLR, a federated VLR framework specially designed for user-specific personalized fusion of vision-language representations. At its core is a novel bi-level fusion mechanism: The server-side multi-view fusion module first generates a diverse set of pre-fused multimodal views. Subsequently, each client employs a user-specific mixture-of-expert mechanism to adaptively integrate these views based on individual user interaction history. This designed lightweight personalized fusion module provides an efficient solution to implement a federated VLR system. The effectiveness of our proposed FedVLR has been validated on seven benchmark datasets.
