Table of Contents
Fetching ...

Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark

Yongliang Wu, Wenbo Zhu, Jiawang Cao, Yi Lu, Bozheng Li, Weiheng Chi, Zihan Qiu, Lirian Su, Haolin Zheng, Jay Wu, Xu Yang

TL;DR

This work tackles long-to-short video repurposing by introducing Repurpose-10K, a large-scale, real-UGC dataset with over 10k videos and 120k annotated clips, collected through a two-stage annotation pipeline that leverages LLM-assisted coarse edits, user likes/dislikes, and expert refinement to produce high-quality clip boundaries. It proposes an end-to-end transformer baseline that jointly performs classification and regression over multi-modal segments, augmented by a Caption Enhancement Encoder and a Multi-Modal Align Guider to fuse audio, visual, and caption information and enforce cross-modal consistency with focal and KL-divergence losses; a 1-D IoU-based loss further refines temporal predictions. Through extensive experiments and ablations, the method outperforms state-of-the-art temporal grounding models on Repurpose-10K and demonstrates the critical role of caption signals and alignment in achieving coherent, engaging repurposed clips. The dataset and model establish a practical, scalable benchmark for research into video repurposing, with potential impact on content creation pipelines for social media platforms.

Abstract

The demand for producing short-form videos for sharing on social media platforms has experienced significant growth in recent times. Despite notable advancements in the fields of video summarization and highlight detection, which can create partially usable short films from raw videos, these approaches are often domain-specific and require an in-depth understanding of real-world video content. To tackle this predicament, we propose Repurpose-10K, an extensive dataset comprising over 10,000 videos with more than 120,000 annotated clips aimed at resolving the video long-to-short task. Recognizing the inherent constraints posed by untrained human annotators, which can result in inaccurate annotations for repurposed videos, we propose a two-stage solution to obtain annotations from real-world user-generated content. Furthermore, we offer a baseline model to address this challenging task by integrating audio, visual, and caption aspects through a cross-modal fusion and alignment framework. We aspire for our work to ignite groundbreaking research in the lesser-explored realms of video repurposing.

Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark

TL;DR

This work tackles long-to-short video repurposing by introducing Repurpose-10K, a large-scale, real-UGC dataset with over 10k videos and 120k annotated clips, collected through a two-stage annotation pipeline that leverages LLM-assisted coarse edits, user likes/dislikes, and expert refinement to produce high-quality clip boundaries. It proposes an end-to-end transformer baseline that jointly performs classification and regression over multi-modal segments, augmented by a Caption Enhancement Encoder and a Multi-Modal Align Guider to fuse audio, visual, and caption information and enforce cross-modal consistency with focal and KL-divergence losses; a 1-D IoU-based loss further refines temporal predictions. Through extensive experiments and ablations, the method outperforms state-of-the-art temporal grounding models on Repurpose-10K and demonstrates the critical role of caption signals and alignment in achieving coherent, engaging repurposed clips. The dataset and model establish a practical, scalable benchmark for research into video repurposing, with potential impact on content creation pipelines for social media platforms.

Abstract

The demand for producing short-form videos for sharing on social media platforms has experienced significant growth in recent times. Despite notable advancements in the fields of video summarization and highlight detection, which can create partially usable short films from raw videos, these approaches are often domain-specific and require an in-depth understanding of real-world video content. To tackle this predicament, we propose Repurpose-10K, an extensive dataset comprising over 10,000 videos with more than 120,000 annotated clips aimed at resolving the video long-to-short task. Recognizing the inherent constraints posed by untrained human annotators, which can result in inaccurate annotations for repurposed videos, we propose a two-stage solution to obtain annotations from real-world user-generated content. Furthermore, we offer a baseline model to address this challenging task by integrating audio, visual, and caption aspects through a cross-modal fusion and alignment framework. We aspire for our work to ignite groundbreaking research in the lesser-explored realms of video repurposing.

Paper Structure

This paper contains 32 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The distinction between Video Repurposing and other similar tasks. From top to bottom: Video Repurposing, Highlights Detection, Temporary Event Localization, Video Chapter, and Video Summarization.
  • Figure 2: Histogram of Repurpose-10K videos duration. From left to right: collection videos duration, repurpose clips duration (y-axis: number of videos), log-scale distribution of collection videos view counts.
  • Figure 3: The overall architecture of our proposed baseline model consists of two main components: the Caption Enhancement Encoder and the Multi-Modal Align Guider. Q: Query. K: Key. V: Value. FC: Fully Connected Layer.
  • Figure 4: Two visual examples of video repurposing results. Blue: Ground Truth. Yellow: Predictions.