Table of Contents
Fetching ...

MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

Zhanheng Nie, Chenghan Fu, Daoze Zhang, Junxian Wu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng

TL;DR

MOON2.0 tackles modality imbalance, intra-product alignment, and noisy data in e-commerce multimodal understanding by integrating a dynamic Modality-driven MoE for Multimodal Joint Learning, a Dual-level Alignment objective (inter- and intra-product), and an MLLM-based image-text co-augmentation strategy with Dynamic Sample Filtering. It also introduces MBE2.0, a large-scale co-augmented benchmark for retrieval, classification, and attribute prediction. Empirical results demonstrate state-of-the-art zero-shot performance on MBE2.0 and public datasets, supported by qualitative heatmaps showing improved image-text alignment. Together, these contributions yield robust, modality-balanced representations for diverse e-commerce tasks and suggest broader applicability to multimodal learning challenges.

Abstract

The rapid growth of e-commerce calls for multimodal models that comprehend rich visual and textual product information. Although recent multimodal large language models (MLLMs) for product understanding exhibit strong capability in representation learning for e-commerce, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced multimodal representation learning framework for e-commerce product understanding. MOON2.0 comprises: (1) a Modality-driven Mixture-of-Experts (MoE) module that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further introduce MBE2.0, a co-augmented multimodal representation benchmark for e-commerce representation learning and evaluation. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.

MOON2.0: Dynamic Modality-balanced Multimodal Representation Learning for E-commerce Product Understanding

TL;DR

MOON2.0 tackles modality imbalance, intra-product alignment, and noisy data in e-commerce multimodal understanding by integrating a dynamic Modality-driven MoE for Multimodal Joint Learning, a Dual-level Alignment objective (inter- and intra-product), and an MLLM-based image-text co-augmentation strategy with Dynamic Sample Filtering. It also introduces MBE2.0, a large-scale co-augmented benchmark for retrieval, classification, and attribute prediction. Empirical results demonstrate state-of-the-art zero-shot performance on MBE2.0 and public datasets, supported by qualitative heatmaps showing improved image-text alignment. Together, these contributions yield robust, modality-balanced representations for diverse e-commerce tasks and suggest broader applicability to multimodal learning challenges.

Abstract

The rapid growth of e-commerce calls for multimodal models that comprehend rich visual and textual product information. Although recent multimodal large language models (MLLMs) for product understanding exhibit strong capability in representation learning for e-commerce, they still face three challenges: (i) the modality imbalance induced by modality mixed training; (ii) underutilization of the intrinsic alignment relationships among visual and textual information within a product; and (iii) limited handling of noise in e-commerce multimodal data. To address these, we propose MOON2.0, a dynamic modality-balanced multimodal representation learning framework for e-commerce product understanding. MOON2.0 comprises: (1) a Modality-driven Mixture-of-Experts (MoE) module that adaptively processes input samples by their modality composition, enabling Multimodal Joint Learning to mitigate the modality imbalance; (2) a Dual-level Alignment method to better leverage semantic alignment properties inside individual products; and (3) an MLLM-based Image-text Co-augmentation strategy that integrates textual enrichment with visual expansion, coupled with Dynamic Sample Filtering to improve training data quality. We further introduce MBE2.0, a co-augmented multimodal representation benchmark for e-commerce representation learning and evaluation. Experiments show that MOON2.0 delivers state-of-the-art zero-shot performance on MBE2.0 and multiple public datasets. Furthermore, attention-based heatmap visualization provides qualitative evidence of improved multimodal alignment of MOON2.0.

Paper Structure

This paper contains 17 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overall results on all the downstream tasks. The arrows in the diagram indicate retrieval relationships of retrieval tasks.
  • Figure 2: Modality imbalance under the training set of mixed training strategy. The image-based and text-based retrieval refer to the image-to-multimodal and text-to-multimodal retrieval tasks, and the dashed line corresponds to the model performance following multi-objective joint training.
  • Figure 3: Pipeline of our MOON2.0. Given a training triplet consisting of a query, a positive item, and a negative item, the model processes each element into three input modalities: multimodal ($x^{mm}$, combining both image and text), image-only ($x^{i}$), and text-only ($x^{t}$). In addition, positive item and negative item will further include enriched title and augmented images.
  • Figure 4: (a) Modality-driven MoE. We adopt the MoE module for the feed-forward layers of the LLM backbone. Each of $\{\hat{h}_{\text{q}},\hat{h}_{\text{p}},\hat{h}_{\text{n}}\}$ includes hidden states for $\text{t},\text{i},\text{mm}$ input modalities. (b) Dual-level Alignment. Besides inter-product alignment, we introduce intra-product alignment to further leverage the semantic consistency within the e-commerce products. The arrow symbol ($\rightarrow$) denotes the alignment relationship. Specifically, $\text{q},\text{p},\text{n}$ represent the query, positive item, and negative item, while $\text{t},\text{i},\text{mm}$ denote text-only, image-only, and multimodal input modalities.
  • Figure 5: The MLLM-based Image-text Co-augmentation pipeline.
  • ...and 2 more figures