Table of Contents
Fetching ...

Multimodal Quantitative Language for Generative Recommendation

Jianyang Zhai, Zi-Feng Mai, Chang-Dong Wang, Feidiao Yang, Xiawu Zheng, Hui Li, Yonghong Tian

TL;DR

This work tackles the mismatch between generative, PLM-driven recommendations and the rich multimodal content of items by introducing a unified quantitative language that represents text and image content with a shared vocabulary built from modality-specific codebooks. It advances a two-stage learning framework, featuring frozen encoders, a residual-quantized VAE translator, and a suite of quantitative language generation tasks (Next Item Generation, Asymmetric Generation, and Alignment) to transfer recommendation knowledge across domains and modalities through pre-training and fine-tuning. Empirical results on three Amazon-based datasets show consistent $NDCG$ gains over strong baselines, validating cross-domain and cross-modal knowledge transfer and highlighting the method's scalability and potential for universal recommendation models. The approach also discusses practical considerations, such as collision handling and computational cost, and points to future work on missing item content and further efficiency improvements.

Abstract

Generative recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. Most existing methods attempt to leverage prior knowledge embedded in Pre-trained Language Models (PLMs) to improve the recommendation performance. However, they often fail to accommodate the differences between the general linguistic knowledge of PLMs and the specific needs of recommendation systems. Moreover, they rarely consider the complementary knowledge between the multimodal information of items, which represents the multi-faceted preferences of users. To facilitate efficient recommendation knowledge transfer, we propose a novel approach called Multimodal Quantitative Language for Generative Recommendation (MQL4GRec). Our key idea is to transform items from different domains and modalities into a unified language, which can serve as a bridge for transferring recommendation knowledge. Specifically, we first introduce quantitative translators to convert the text and image content of items from various domains into a new and concise language, known as quantitative language, with all items sharing the same vocabulary. Then, we design a series of quantitative language generation tasks to enrich quantitative language with semantic information and prior knowledge. Finally, we achieve the transfer of recommendation knowledge from different domains and modalities to the recommendation task through pre-training and fine-tuning. We evaluate the effectiveness of MQL4GRec through extensive experiments and comparisons with existing methods, achieving improvements over the baseline by 11.18\%, 14.82\%, and 7.95\% on the NDCG metric across three different datasets, respectively.

Multimodal Quantitative Language for Generative Recommendation

TL;DR

This work tackles the mismatch between generative, PLM-driven recommendations and the rich multimodal content of items by introducing a unified quantitative language that represents text and image content with a shared vocabulary built from modality-specific codebooks. It advances a two-stage learning framework, featuring frozen encoders, a residual-quantized VAE translator, and a suite of quantitative language generation tasks (Next Item Generation, Asymmetric Generation, and Alignment) to transfer recommendation knowledge across domains and modalities through pre-training and fine-tuning. Empirical results on three Amazon-based datasets show consistent gains over strong baselines, validating cross-domain and cross-modal knowledge transfer and highlighting the method's scalability and potential for universal recommendation models. The approach also discusses practical considerations, such as collision handling and computational cost, and points to future work on missing item content and further efficiency improvements.

Abstract

Generative recommendation has emerged as a promising paradigm aiming at directly generating the identifiers of the target candidates. Most existing methods attempt to leverage prior knowledge embedded in Pre-trained Language Models (PLMs) to improve the recommendation performance. However, they often fail to accommodate the differences between the general linguistic knowledge of PLMs and the specific needs of recommendation systems. Moreover, they rarely consider the complementary knowledge between the multimodal information of items, which represents the multi-faceted preferences of users. To facilitate efficient recommendation knowledge transfer, we propose a novel approach called Multimodal Quantitative Language for Generative Recommendation (MQL4GRec). Our key idea is to transform items from different domains and modalities into a unified language, which can serve as a bridge for transferring recommendation knowledge. Specifically, we first introduce quantitative translators to convert the text and image content of items from various domains into a new and concise language, known as quantitative language, with all items sharing the same vocabulary. Then, we design a series of quantitative language generation tasks to enrich quantitative language with semantic information and prior knowledge. Finally, we achieve the transfer of recommendation knowledge from different domains and modalities to the recommendation task through pre-training and fine-tuning. We evaluate the effectiveness of MQL4GRec through extensive experiments and comparisons with existing methods, achieving improvements over the baseline by 11.18\%, 14.82\%, and 7.95\% on the NDCG metric across three different datasets, respectively.

Paper Structure

This paper contains 38 sections, 7 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Illustration of our MQL4GRec. We translate items from different domains and modalities into a new unified language, which can then serve as a bridge for transferring recommendation knowledge.
  • Figure 2: The overall framework of MQL4GRec. We regard the quantizer as a translator, converting item content from different domains and modalities into a unified quantitative language, thus bridging the gap between them (left). Subsequently, we design a series of quantitative language generation tasks to facilitate the transfer of recommendation knowledge through pre-training and fine-tuning (right).
  • Figure 3: The impact of varying amounts of pre-training datasets on recommendation performance.
  • Figure 4: The impact of different pre-training epochs on recommendation performance.