Personalized Parameter-Efficient Fine-Tuning of Foundation Models for Multimodal Recommendation

Sunwoo Kim; Hyunjin Hwang; Kijung Shin

Personalized Parameter-Efficient Fine-Tuning of Foundation Models for Multimodal Recommendation

Sunwoo Kim, Hyunjin Hwang, Kijung Shin

TL;DR

This paper tackles the challenge that item embeddings produced by multimodal foundation models are not conditioned on user interests, limiting personalization in multimodal recommendations. It introduces PerPEFT, a personalized PEFT framework that partitions users into interest-based groups and assigns a distinct PEFT module to each group, enabling group-specific emphasis on item aspects while keeping most of the foundation model fixed. A specialized training strategy using group-specific negative sampling further enhances learning of purchase-relevant, group-aligned item features. Across four real-world datasets and multiple PEFT backbones, PerPEFT consistently outperforms baselines (e.g., Global PEFT) with gains up to 15.3% in NDCG@20, while adding only about 1.3% more parameters than the backbone model. The work demonstrates that personalized, group-wise adapters can significantly improve multimodal recommendations with modest computational overhead, and it provides code and datasets for reproducibility.

Abstract

In recent years, substantial research has integrated multimodal item metadata into recommender systems, often by using pre-trained multimodal foundation models to encode such data. Since these models are not originally trained for recommendation tasks, recent works efficiently adapt them via parameter-efficient fine-tuning (PEFT). However, even with PEFT, item embeddings from multimodal foundation models remain user-blind: item embeddings are not conditioned on user interests, despite the fact that users with diverse interests attend to different item aspects. To address this limitation, we propose PerPEFT, a personalized PEFT strategy for multimodal recommendation. Specifically, PerPEFT groups users by interest and assigns a distinct PEFT module to each group, enabling each module to capture the fine-grained item aspects most predictive of that group`s purchase decisions. We further introduce a specialized training technique that strengthens this user-group conditioning. Notably, PerPEFT is PEFT-agnostic and can be paired with any PEFT method applicable to multimodal foundation models. Through extensive experiments, we show that (1) PerPEFT outperforms the strongest baseline by up to 15.3% (NDCG@20) and (2) delivers consistent gains across diverse PEFT variants. It is noteworthy that, even with personalization, PEFT remains lightweight, adding only 1.3% of the parameter count of the foundation model. We provide our code and datasets at https://github.com/kswoo97/PerPEFT.

Personalized Parameter-Efficient Fine-Tuning of Foundation Models for Multimodal Recommendation

TL;DR

Abstract

Paper Structure (71 sections, 2 equations, 8 figures, 9 tables)

This paper contains 71 sections, 2 equations, 8 figures, 9 tables.

Introduction
Related work and preliminary
Related work
Multimodal recommendation
Parameter-efficient fine-tuning (PEFT)
Preliminary
Proposed method
Naive approach: Global PEFT
Item encoding
Recommendation process
Proposed method: PerPEFT
High-level idea.
User grouping
Multimodal information encoding.
Final item embeddings and recommendation process
...and 56 more sections

Figures (8)

Figure 1: PerPEFT guides CLIP radford2021learning, a multimodal foundation model, to focus on item aspects aligned with each user group's interests. For a golf travel bag, CLIP personalized to a golf-interest group attends to 'Golf', while CLIP zpersonalized to a camping-interest group focuses on 'Travel' and 'Bag'.
Figure 2: An example case of (A) Global PEFT and (B) PerPEFT, our personalized PEFT for multimodal recommendation. Each item’s multimodal information is encoded by CLIP, a multimodal foundation model, with an attached PEFT module. Unlike Global PEFT, which uses the same PEFT module for users $u_{1}$ and $u_{2}$, PerPEFT employs different PEFT modules according to their inferred interest groups. Subsequently, we form each item embedding by summing the multimodal and transductive embeddings. To generate recommendations for a user, we construct the sequence of item embeddings in the same order as the user's purchase history and feed it into SASRec, the backbone recommender system.
Figure 3: An example case of our training technique for PerPEFT. When training for Group A, we draw negative samples only from items appearing in Group A users’ purchase histories, instead of the full item set. The same holds for Group B.
Figure 4: (RQ2) Scalability analysis of PerPEFT. Even with additional PEFT modules, PerPEFT incurs only a slight increase in training time relative to Global PEFT. Moreover, the parameters introduced by PerPEFT are only 1.3% of those of the multimodal foundation model (CLIP radford2021learning).
Figure 6: (RQ4) Achievement of PerPEFT's design objective. PEFT modules specialized for different groups guide CLIP radford2021learning, a multimodal foundation model, to focus on different aspects of the same item.
...and 3 more figures

Personalized Parameter-Efficient Fine-Tuning of Foundation Models for Multimodal Recommendation

TL;DR

Abstract

Personalized Parameter-Efficient Fine-Tuning of Foundation Models for Multimodal Recommendation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)