Table of Contents
Fetching ...

CIRP: Cross-Item Relational Pre-training for Multimodal Product Bundling

Yunshan Ma, Yingzhi He, Wenjun Zhong, Xiang Wang, Roger Zimmermann, Tat-Seng Chua

TL;DR

CIRP addresses the need for high-quality, relation-aware multimodal item representations for product bundling by jointly modeling intra-item multimodal semantics and cross-item relations. It introduces a cross-item relational pre-training framework with ITC and CIC losses, coupled with a relation pruning module to denoise the graph and accelerate training. The method yields superior performance over various baselines across three large e-commerce datasets and demonstrates robustness to cold-start items, while maintaining efficiency gains from pruning. This approach offers practical benefits for bundle construction and can extend to broader downstream tasks that require cross-item relational understanding in multimodal settings.

Abstract

Product bundling has been a prevailing marketing strategy that is beneficial in the online shopping scenario. Effective product bundling methods depend on high-quality item representations, which need to capture both the individual items' semantics and cross-item relations. However, previous item representation learning methods, either feature fusion or graph learning, suffer from inadequate cross-modal alignment and struggle to capture the cross-item relations for cold-start items. Multimodal pre-train models could be the potential solutions given their promising performance on various multimodal downstream tasks. However, the cross-item relations have been under-explored in the current multimodal pre-train models. To bridge this gap, we propose a novel and simple framework Cross-Item Relational Pre-training (CIRP) for item representation learning in product bundling. Specifically, we employ a multimodal encoder to generate image and text representations. Then we leverage both the cross-item contrastive loss (CIC) and individual item's image-text contrastive loss (ITC) as the pre-train objectives. Our method seeks to integrate cross-item relation modeling capability into the multimodal encoder, while preserving the in-depth aligned multimodal semantics. Therefore, even for cold-start items that have no relations, their representations are still relation-aware. Furthermore, to eliminate the potential noise and reduce the computational cost, we harness a relation pruning module to remove the noisy and redundant relations. We apply the item representations extracted by CIRP to the product bundling model ItemKNN, and experiments on three e-commerce datasets demonstrate that CIRP outperforms various leading representation learning methods.

CIRP: Cross-Item Relational Pre-training for Multimodal Product Bundling

TL;DR

CIRP addresses the need for high-quality, relation-aware multimodal item representations for product bundling by jointly modeling intra-item multimodal semantics and cross-item relations. It introduces a cross-item relational pre-training framework with ITC and CIC losses, coupled with a relation pruning module to denoise the graph and accelerate training. The method yields superior performance over various baselines across three large e-commerce datasets and demonstrates robustness to cold-start items, while maintaining efficiency gains from pruning. This approach offers practical benefits for bundle construction and can extend to broader downstream tasks that require cross-item relational understanding in multimodal settings.

Abstract

Product bundling has been a prevailing marketing strategy that is beneficial in the online shopping scenario. Effective product bundling methods depend on high-quality item representations, which need to capture both the individual items' semantics and cross-item relations. However, previous item representation learning methods, either feature fusion or graph learning, suffer from inadequate cross-modal alignment and struggle to capture the cross-item relations for cold-start items. Multimodal pre-train models could be the potential solutions given their promising performance on various multimodal downstream tasks. However, the cross-item relations have been under-explored in the current multimodal pre-train models. To bridge this gap, we propose a novel and simple framework Cross-Item Relational Pre-training (CIRP) for item representation learning in product bundling. Specifically, we employ a multimodal encoder to generate image and text representations. Then we leverage both the cross-item contrastive loss (CIC) and individual item's image-text contrastive loss (ITC) as the pre-train objectives. Our method seeks to integrate cross-item relation modeling capability into the multimodal encoder, while preserving the in-depth aligned multimodal semantics. Therefore, even for cold-start items that have no relations, their representations are still relation-aware. Furthermore, to eliminate the potential noise and reduce the computational cost, we harness a relation pruning module to remove the noisy and redundant relations. We apply the item representations extracted by CIRP to the product bundling model ItemKNN, and experiments on three e-commerce datasets demonstrate that CIRP outperforms various leading representation learning methods.
Paper Structure (30 sections, 7 equations, 4 figures, 6 tables)

This paper contains 30 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison of three item representation learning paradigms of incorporating semantic and relational data. Our method integrates the relational info into the multimodal encoder.
  • Figure 2: Illustration of the overall pre-training framework (CIRP) and the downstream task of product bundling. CIRP takes relational and multimodal semantic inputs, leverages a multimodal encoder, and is optimized by the CIC and ITC losses. For the downstream task, we leverage the ItemKNN model and CIRP extracted item representations for product bundling.
  • Figure 3: Analysis of how varying relation pruning rate affect the pre-train efficiency and product bundling performance.
  • Figure 4: Qualitative visualization of how the item representation space shift after applying the semantic pre-training (BLIP-FT) and cross-item relational pre-train (CIRP).