A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation

Yunxin Li; Baotian Hu; Wenhan Luo; Lin Ma; Yuxin Ding; Min Zhang

A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation

Yunxin Li, Baotian Hu, Wenhan Luo, Lin Ma, Yuxin Ding, Min Zhang

TL;DR

A simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference and utilizes the in-context learning capability of language models to produce the description.

Abstract

In this paper, we propose a new setting for generating product descriptions from images, augmented by marketing keywords. It leverages the combined power of visual and textual information to create descriptions that are more tailored to the unique features of products. For this setting, previous methods utilize visual and textual encoders to encode the image and keywords and employ a language model-based decoder to generate the product description. However, the generated description is often inaccurate and generic since same-category products have similar copy-writings, and optimizing the overall framework on large-scale samples makes models concentrate on common words yet ignore the product features. To alleviate the issue, we present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference and utilizes the in-context learning capability of language models to produce the description. During training, we keep the visual encoder and language model frozen, focusing on optimizing the modules responsible for creating multimodal in-context references and dynamic prompts. This approach preserves the language generation prowess of large language models (LLMs), facilitating a substantial increase in description diversity. To assess the effectiveness of ModICT across various language model scales and types, we collect data from three distinct product categories within the E-commerce domain. Extensive experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods. Our findings underscore the potential of ModICT as a valuable tool for enhancing automatic generation of product descriptions in a wide range of applications. Code is at: https://github.com/HITsz-TMG/Multimodal-In-Context-Tuning

A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation

TL;DR

Abstract

Paper Structure (14 sections, 2 equations, 3 figures, 7 tables)

This paper contains 14 sections, 2 equations, 3 figures, 7 tables.

Introduction
Related Work
Methodology
Overview
In-Context Reference Construction
Efficient Multimodal In-Context Tuning
Training and Inference
Experiment
Dataset: MD2T
Experimental Settings
Quantitative Analysis
Ablation Study
Qualitative Analysis
Conclusion

Figures (3)

Figure 1: Illustration of conventional approaches and our method for E-commerce product description generation.
Figure 2: The overall workflow of ModICT. The left part depicts the process of in-context reference construction. The right parts show the efficient multimodal in-context tuning ways for the sequence-to-sequence language model (1) and autoregressive language model (2). Blocks with red lines are learnable.
Figure 3: An illustration of descriptions generated by several models. The blue words represent keyphrases related to marketing keywords. Words in red show the inaccurate expression. The green-colored sentences are the eye-catching statements.

A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation

TL;DR

Abstract

A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)