Bridging Domain Gaps between Pretrained Multimodal Models and Recommendations
Wenyu Zhang, Jie Luo, Xinming Zhang, Yuan Fang
TL;DR
This work tackles the domain gap between pretrained multimodal encoders and personalized recommendations by proposing PTMRec, a two-stage parameter-efficient framework that avoids costly pretraining. In the first stage, a frozen CLIP backbone is used to extract multimodal item features and a lightweight recommender is trained with the BPR objective to capture user preferences. The second stage introduces knowledge-guided prompts into the CLIP encoders and employs in-batch knowledge transfer with KL divergences to align modal features to personalized interaction patterns, enabling effective domain adaptation with minimal training cost. Experiments on three Amazon domains (Baby, Sports, Clothing) show that PTMRec improves Recall and NDCG, especially for simpler base models, and that the two-stage design with KT loss is essential for bridging domain gaps while maintaining efficiency.
Abstract
With the explosive growth of multimodal content online, pre-trained visual-language models have shown great potential for multimodal recommendation. However, while these models achieve decent performance when applied in a frozen manner, surprisingly, due to significant domain gaps (e.g., feature distribution discrepancy and task objective misalignment) between pre-training and personalized recommendation, adopting a joint training approach instead leads to performance worse than baseline. Existing approaches either rely on simple feature extraction or require computationally expensive full model fine-tuning, struggling to balance effectiveness and efficiency. To tackle these challenges, we propose \textbf{P}arameter-efficient \textbf{T}uning for \textbf{M}ultimodal \textbf{Rec}ommendation (\textbf{PTMRec}), a novel framework that bridges the domain gap between pre-trained models and recommendation systems through a knowledge-guided dual-stage parameter-efficient training strategy. This framework not only eliminates the need for costly additional pre-training but also flexibly accommodates various parameter-efficient tuning methods.
