Table of Contents
Fetching ...

D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching

Jingyu Liu, Minquan Wang, Ye Ma, Bo Wang, Aozhu Chen, Quan Chen, Peng Jiang, Xirong Li

TL;DR

This work tackles enriching e-commerce videos with moment-specific sound effects by introducing Video Decoration with SFX (VDSFX) and a dedicated dataset, SFX-Moment. It presents D&M, a DETR-based model that jointly detects key moments and performs moment-to-SFX matching, facilitated by multi-modal video and SFX embeddings. The training utilizes MSM pre-training and Tag-aware Negative Sampling to align cross-modal representations and balance negatives, achieving superior results over strong baselines. The approach demonstrates practical potential for enhancing user engagement in online shopping videos, while revealing avenues for fine-grained visual cues and interactive editing as future work.

Abstract

Videos showcasing specific products are increasingly important for E-commerce. Key moments naturally exist as the first appearance of a specific product, presentation of its distinctive features, the presence of a buying link, etc. Adding proper sound effects (SFX) to these key moments, or video decoration with SFX (VDSFX), is crucial for enhancing the user engaging experience. Previous studies about adding SFX to videos perform video to SFX matching at a holistic level, lacking the ability of adding SFX to a specific moment. Meanwhile, previous studies on video highlight detection or video moment retrieval consider only moment localization, leaving moment to SFX matching untouched. By contrast, we propose in this paper D&M, a unified method that accomplishes key moment detection and moment to SFX matching simultaneously. Moreover, for the new VDSFX task we build a large-scale dataset SFX-Moment from an E-commerce platform. For a fair comparison, we build competitive baselines by extending a number of current video moment detection methods to the new task. Extensive experiments on SFX-Moment show the superior performance of the proposed method over the baselines.

D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching

TL;DR

This work tackles enriching e-commerce videos with moment-specific sound effects by introducing Video Decoration with SFX (VDSFX) and a dedicated dataset, SFX-Moment. It presents D&M, a DETR-based model that jointly detects key moments and performs moment-to-SFX matching, facilitated by multi-modal video and SFX embeddings. The training utilizes MSM pre-training and Tag-aware Negative Sampling to align cross-modal representations and balance negatives, achieving superior results over strong baselines. The approach demonstrates practical potential for enhancing user engagement in online shopping videos, while revealing avenues for fine-grained visual cues and interactive editing as future work.

Abstract

Videos showcasing specific products are increasingly important for E-commerce. Key moments naturally exist as the first appearance of a specific product, presentation of its distinctive features, the presence of a buying link, etc. Adding proper sound effects (SFX) to these key moments, or video decoration with SFX (VDSFX), is crucial for enhancing the user engaging experience. Previous studies about adding SFX to videos perform video to SFX matching at a holistic level, lacking the ability of adding SFX to a specific moment. Meanwhile, previous studies on video highlight detection or video moment retrieval consider only moment localization, leaving moment to SFX matching untouched. By contrast, we propose in this paper D&M, a unified method that accomplishes key moment detection and moment to SFX matching simultaneously. Moreover, for the new VDSFX task we build a large-scale dataset SFX-Moment from an E-commerce platform. For a fair comparison, we build competitive baselines by extending a number of current video moment detection methods to the new task. Extensive experiments on SFX-Moment show the superior performance of the proposed method over the baselines.
Paper Structure (32 sections, 4 equations, 4 figures, 7 tables)

This paper contains 32 sections, 4 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Illustration of video decoration with sound effects (VDSFX), aiming to automatically add proper SFX to key moments, which are also auto-detected, in a given E-commerce video. Moment-DETR+ and $\text{R}^2\text{-Tuning}$+ are baselines we implement, by re-purposing Moment-DETR lei2021detecting and $\text{R}^2\text{-Tuning}$liu2024r for the new task, with their detected moments used for moment-to-SFX matching. Best viewed digitally.
  • Figure 2: Diagram of our proposed D&M method for VDSFX. The input video as an example consists of $30$ frames with $9$ subtitles. Each sound effect, indexed by $k$, is jointly represented by an audio clip $a_k$, a manually written short description $d_k$ and a categorical tag $y_k$. SFX0 is a special token indicating "no SFX". The ASR module and the visual / textual / audio backbones, i.e. ViT / RoBERTa / AST, are all frozen. Non-trainable blocks are shown in gray. Best viewed on screen.
  • Figure 3: Visualization of SFX-Moment. (a) Video duration. (b) Centers of key moments normalized by video length. (c) Snapshots of video samples.
  • Figure 4: Some qualitative results. Best viewed digitally.