Table of Contents
Fetching ...

Federated Vision-Language-Recommendation with Personalized Fusion

Zhiwei Li, Guodong Long, Jing Jiang, Chengqi Zhang, Qiang Yang

TL;DR

FedVLR tackles personalized multimodal fusion for on-device Vision-Language-Recommendation in a federated setting. It decouples server-side view generation from client-side refinement via a Bi-Level Fusion Mechanism that uses diverse fusion operators and a Mixture-of-Experts router to tailor representations to individual user histories. The approach is supported by convergence and complexity analyses and validated across seven diverse datasets, where it outperforms standard federated baselines and rivals centralized models in low-data regimes, while preserving privacy. This framework enables privacy-preserving, content-aware recommendations by leveraging visual, textual, and collaborative signals with efficient on-device personalization.

Abstract

Applying large pre-trained Vision-Language Models to recommendation is a burgeoning field, a direction we term Vision-Language-Recommendation (VLR). Bringing VLR to user-oriented on-device intelligence within a federated learning framework is a crucial step for enhancing user privacy and delivering personalized experiences. This paper introduces FedVLR, a federated VLR framework specially designed for user-specific personalized fusion of vision-language representations. At its core is a novel bi-level fusion mechanism: The server-side multi-view fusion module first generates a diverse set of pre-fused multimodal views. Subsequently, each client employs a user-specific mixture-of-expert mechanism to adaptively integrate these views based on individual user interaction history. This designed lightweight personalized fusion module provides an efficient solution to implement a federated VLR system. The effectiveness of our proposed FedVLR has been validated on seven benchmark datasets.

Federated Vision-Language-Recommendation with Personalized Fusion

TL;DR

FedVLR tackles personalized multimodal fusion for on-device Vision-Language-Recommendation in a federated setting. It decouples server-side view generation from client-side refinement via a Bi-Level Fusion Mechanism that uses diverse fusion operators and a Mixture-of-Experts router to tailor representations to individual user histories. The approach is supported by convergence and complexity analyses and validated across seven diverse datasets, where it outperforms standard federated baselines and rivals centralized models in low-data regimes, while preserving privacy. This framework enables privacy-preserving, content-aware recommendations by leveraging visual, textual, and collaborative signals with efficient on-device personalization.

Abstract

Applying large pre-trained Vision-Language Models to recommendation is a burgeoning field, a direction we term Vision-Language-Recommendation (VLR). Bringing VLR to user-oriented on-device intelligence within a federated learning framework is a crucial step for enhancing user privacy and delivering personalized experiences. This paper introduces FedVLR, a federated VLR framework specially designed for user-specific personalized fusion of vision-language representations. At its core is a novel bi-level fusion mechanism: The server-side multi-view fusion module first generates a diverse set of pre-fused multimodal views. Subsequently, each client employs a user-specific mixture-of-expert mechanism to adaptively integrate these views based on individual user interaction history. This designed lightweight personalized fusion module provides an efficient solution to implement a federated VLR system. The effectiveness of our proposed FedVLR has been validated on seven benchmark datasets.

Paper Structure

This paper contains 37 sections, 4 theorems, 29 equations, 10 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Let the number of participating clients per round be $n_s$. After $T$ communication rounds with learning rate $\eta$, the algorithm satisfies:

Figures (10)

  • Figure 1: Paradigm shift from a monolithic generic fusion (Left) to our personalized fusion (Right), enabling fine-grained on-device personalization by decoupling server-side view generation from client-side refinement.
  • Figure 2: The framework of FedVLR. It comprises two components: (1) Server-Side Multi-View Fusion, which generates diverse pre-fused feature views from visual-language content, and (2) Client-Side Personalized Refinement, which dynamically combines these views based on the user's private interaction history for on-device VLR.
  • Figure 3: Analysis of user characteristics learned by FedRAP enhanced with FedVLR on KU: (a) User distribution by interaction count; (b) User modality preference; (c) User activity heterogeneity distribution; (d) Performance trend across user groups.
  • Figure 4: Impact on the performance after removing visual ($\mathbf{V}$), textual ($\mathbf{C}$), or collaborative ID ($\mathbf{D}$) features on KU. The varying degrees of the performance degradation across different frameworks demonstrate that there is no universal hierarchy of modality importance, motivating the need of a personalized fusion mechanism for federated VLR tasks.
  • Figure 5: Algorithmic workflow of FedVLR, illustrating a clear division of labor: the server offloads heavy computation by generating diverse feature views, while clients perform lightweight on-device personalization. The returned gradients form a collaborative loop that continually refines the global model.
  • ...and 5 more figures

Theorems & Definitions (8)

  • Theorem 1: Convergence of FedVLR
  • Lemma 1: Local Parameter Drift
  • proof : Proof of Lemma \ref{['lemma:drift']} (Local Update Bound)
  • Lemma 2: Gradient Bias Bound
  • proof : Proof of Lemma \ref{['lemma:bias']} (Gradient Bias Bound)
  • Lemma 3: Aggregated Gradient Second Moment Bound
  • proof : Proof of Lemma \ref{['lemma:second_moment']} (Aggregated Gradient Second Moment)
  • proof : Proof of Theorem 1 (Convergence Rate)