E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs
Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng
TL;DR
This work introduces E-VAds, the first benchmark tailored to conversion-oriented e-commerce short videos, and demonstrates that these videos exhibit high multimodal information density across vision, audio, and text. It proposes a multi-modal information density framework with $V_{\mathrm{den}}$, $A_{\mathrm{den}}$, and $O_{\mathrm{den}}$, and shows E-VAds is denser than existing benchmarks, presenting a challenging testbed for MLLMs. To tackle open-ended commercial reasoning, the paper develops E-VAds-R1, an RL-based model using MG-GRPO—a multi-grained reward design—that grounds answers in multi-modal evidence. With a few hundred training samples, E-VAds-R1 achieves a 109.2% relative improvement in commercial-intent reasoning over strong baselines, highlighting data-efficient, evidence-grounded learning for modality-dense domains. The dataset, evaluation protocol, and reward design offer a pathway toward practical, industry-relevant multimodal reasoning in e-commerce video understanding.
Abstract
E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark (E-VAds), which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.
