Table of Contents
Fetching ...

Structural and Disentangled Adaptation of Large Vision Language Models for Multimodal Recommendation

Zhongtao Rao, Peilin Zhou, Dading Chong, Zhiwei Chen, Shoujin Wang, Nan Tang

TL;DR

This work tackles two hlav challenges in adapting Large Vision-Language Models for multimodal recommendation: representation misalignment between item-domain data and LVLM pretraining, and gradient conflicts from shared adapters during fine-tuning. It introduces SDA, combining Cross-Modal Structural Alignment (CMSA) and Modality-Disentangled Adaptation (MoDA) to align embeddings and disentangle gradient flows, respectively, within a lightweight two-stage pipeline. Experiments across three Amazon datasets show SDA consistently improves Hit@10 and NDCG@10 across multiple backbones, with notable gains on long-tail items and minimal inference overhead. The results demonstrate that targeted structural alignment and modality-aware low-rank adaptation can unlock LVLMs’ potential for practical, scalable multimodal recommendation.

Abstract

Multimodal recommendation enhances accuracy by leveraging visual and textual signals, and its success largely depends on learning high-quality cross-modal representations. Recent advances in Large Vision-Language Models (LVLMs) offer unified multimodal representation learning, making them a promising backbone. However, applying LVLMs to recommendation remains challenging due to (i) representation misalignment, where domain gaps between item data and general pre-training lead to unaligned embedding spaces, and (ii) gradient conflicts during fine-tuning, where shared adapters cause interference and a lack of discriminative power. To address this, we propose SDA, a lightweight framework for Structural and Disentangled Adaptation, which integrates two components: Cross-Modal Structural Alignment (CMSA) and Modality-Disentangled Adaptation. CMSA aligns embeddings using intra-modal structures as a soft teacher, while MoDA mitigates gradient conflicts via expertized, gated low-rank paths to disentangle gradient flows. Experiments on three public Amazon datasets show SDA integrates seamlessly with existing multimodal and sequential recommenders, yielding average gains of 6.15% in Hit@10 and 8.64% in NDCG@10. It also achieves up to 12.83% and 18.70% gains on long-tail items with minimal inference overhead. Our code and full experimental results are available at https://github.com/RaoZhongtao/SDA.

Structural and Disentangled Adaptation of Large Vision Language Models for Multimodal Recommendation

TL;DR

This work tackles two hlav challenges in adapting Large Vision-Language Models for multimodal recommendation: representation misalignment between item-domain data and LVLM pretraining, and gradient conflicts from shared adapters during fine-tuning. It introduces SDA, combining Cross-Modal Structural Alignment (CMSA) and Modality-Disentangled Adaptation (MoDA) to align embeddings and disentangle gradient flows, respectively, within a lightweight two-stage pipeline. Experiments across three Amazon datasets show SDA consistently improves Hit@10 and NDCG@10 across multiple backbones, with notable gains on long-tail items and minimal inference overhead. The results demonstrate that targeted structural alignment and modality-aware low-rank adaptation can unlock LVLMs’ potential for practical, scalable multimodal recommendation.

Abstract

Multimodal recommendation enhances accuracy by leveraging visual and textual signals, and its success largely depends on learning high-quality cross-modal representations. Recent advances in Large Vision-Language Models (LVLMs) offer unified multimodal representation learning, making them a promising backbone. However, applying LVLMs to recommendation remains challenging due to (i) representation misalignment, where domain gaps between item data and general pre-training lead to unaligned embedding spaces, and (ii) gradient conflicts during fine-tuning, where shared adapters cause interference and a lack of discriminative power. To address this, we propose SDA, a lightweight framework for Structural and Disentangled Adaptation, which integrates two components: Cross-Modal Structural Alignment (CMSA) and Modality-Disentangled Adaptation. CMSA aligns embeddings using intra-modal structures as a soft teacher, while MoDA mitigates gradient conflicts via expertized, gated low-rank paths to disentangle gradient flows. Experiments on three public Amazon datasets show SDA integrates seamlessly with existing multimodal and sequential recommenders, yielding average gains of 6.15% in Hit@10 and 8.64% in NDCG@10. It also achieves up to 12.83% and 18.70% gains on long-tail items with minimal inference overhead. Our code and full experimental results are available at https://github.com/RaoZhongtao/SDA.

Paper Structure

This paper contains 16 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Illustration of two key challenges when applying LVLMs to recommendation.
  • Figure 2: Overview of the proposed SDA framework.
  • Figure 3: Impact of individual and combined modalities on the Toys dataset with SLMRec as backbone.