Table of Contents
Fetching ...

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng

TL;DR

This work introduces E-VAds, the first benchmark tailored to conversion-oriented e-commerce short videos, and demonstrates that these videos exhibit high multimodal information density across vision, audio, and text. It proposes a multi-modal information density framework with $V_{\mathrm{den}}$, $A_{\mathrm{den}}$, and $O_{\mathrm{den}}$, and shows E-VAds is denser than existing benchmarks, presenting a challenging testbed for MLLMs. To tackle open-ended commercial reasoning, the paper develops E-VAds-R1, an RL-based model using MG-GRPO—a multi-grained reward design—that grounds answers in multi-modal evidence. With a few hundred training samples, E-VAds-R1 achieves a 109.2% relative improvement in commercial-intent reasoning over strong baselines, highlighting data-efficient, evidence-grounded learning for modality-dense domains. The dataset, evaluation protocol, and reward design offer a pathway toward practical, industry-relevant multimodal reasoning in e-commerce video understanding.

Abstract

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark (E-VAds), which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

TL;DR

This work introduces E-VAds, the first benchmark tailored to conversion-oriented e-commerce short videos, and demonstrates that these videos exhibit high multimodal information density across vision, audio, and text. It proposes a multi-modal information density framework with , , and , and shows E-VAds is denser than existing benchmarks, presenting a challenging testbed for MLLMs. To tackle open-ended commercial reasoning, the paper develops E-VAds-R1, an RL-based model using MG-GRPO—a multi-grained reward design—that grounds answers in multi-modal evidence. With a few hundred training samples, E-VAds-R1 achieves a 109.2% relative improvement in commercial-intent reasoning over strong baselines, highlighting data-efficient, evidence-grounded learning for modality-dense domains. The dataset, evaluation protocol, and reward design offer a pathway toward practical, industry-relevant multimodal reasoning in e-commerce video understanding.

Abstract

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark (E-VAds), which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.
Paper Structure (37 sections, 9 equations, 14 figures, 5 tables)

This paper contains 37 sections, 9 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Overview of E-VAds benchmark.
  • Figure 2: Statistics of E-VAds benchmark.
  • Figure 3: Dataset Construction Pipeline.
  • Figure 4: In the E-VAds-R1 framework, given a question, the policy model produces multiple responses including think and answer; these are scored by a reward model, and the resulting rewards guide policy updates through policy gradient optimization.
  • Figure 5: Detailed distributions of multi-modal information density metrics ($V_{den}$, $A_{den}$, and $O_{den}$) across datasets.
  • ...and 9 more figures