Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

Xiao Wang; Jianlong Wu; Zijia Lin; Fuzheng Zhang; Di Zhang; Liqiang Nie

Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

Xiao Wang, Jianlong Wu, Zijia Lin, Fuzheng Zhang, Di Zhang, Liqiang Nie

TL;DR

This study quantitatively reveals an “impossible trinity” among data quantity, diversity, and quality in pre-training datasets, and introduces the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods.

Abstract

Recently, video-language understanding has achieved great success through large-scale pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. These methods successfully leverage useful information in multimodal video content (frames, tags, ASR transcripts, etc.) to refine the original annotations. Nevertheless, they struggle to mitigate noise within synthetic annotations and lack scalability as the dataset size expands. To address these issues, we introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods. For iterative refinement, we first leverage a video-language model to generate synthetic annotations, resulting in a refined dataset. Then, we pre-train on it and fine-tune on human refinement examples for a stronger model. These processes are repeated for continuous improvement. For noise control, we present AdaTaiLr, a novel noise control method that requires weaker assumptions on noise distribution, thereby proving more effective in large datasets with theoretical guarantees. The combination of iterative refinement and AdaTaiLr can achieve better scalability in video-language understanding. Extensive experiments show that our framework outperforms existing data refinement baselines, delivering a 3% performance boost and improving dataset quality with minimal diversity loss. Furthermore, our refined dataset facilitates significant improvements in various video-language understanding tasks, including video question answering and text-video retrieval.

Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

TL;DR

Abstract

Paper Structure (41 sections, 9 theorems, 61 equations, 12 figures, 9 tables, 1 algorithm)

This paper contains 41 sections, 9 theorems, 61 equations, 12 figures, 9 tables, 1 algorithm.

Introduction
Related work
Video-Language Datasets
Refining Video-Language Datasets from Web
Video Large Language Models
Data Flywheel for video-language understanding
Overview of VidDF
Annotation Refinement
Annotation Refinement at the Initial Stage
Annotation Refinement at the Iterative Stage
AdaTaiLr: Noise Control for Pre-training
Preliminaries for KLD, TVD, and TaiLr
AdaTaiLr
Pre-training and Supervised Fine-tuning
Experiments
...and 26 more sections

Key Result

Theorem 3.1

Given a VideoLLM model $p_\theta^{<t}(y_t|\textbf{y}_{<t}, \textbf{x})$ parameterized by $\theta$ and the real data distribution $p_o^{<t}(y_t|\textbf{y}_{<t}, \textbf{x})$. The following function: where where $\mathbbm{1}[z]$ is the indicator function: minimizes the upper bound of TaiLr estimation error $\epsilon$:

Figures (12)

Figure 1: For the impossible data trinity (a) among video-language pre-training datasets, we propose the Video DataFlywheel (b) for data refinement. It achieves better trinity (c) and scalability (d) in large data.
Figure 2: Unified framework of existing dataset refinement methods, consisting of three procedures in diamond boxes.
Figure 3: Method overview. (a) Our video dataflywheel framework comprises two stages. The initial refinement stage refines the ASR dataset by prompting LLM and ILM, since there is no VideoLLM at this stage. The iterative refinement stage refines the dataset using VideoLLM trained in the previous stage. AdaTaiLr is applied for noise control at both stages in pre-training. (b) During initial refinement, an LLM summarizes the image captions generated by frames. (c) In iterative refinement, a VideoLLM generates annotations based on multi-modal video content.
Figure 4: Ablation studies of video dataflywheel framework.
Figure 5: Sensitivity analysis of $\lambda$ controlling the smoothness of approximation in AdaTaiLr.
...and 7 more figures

Theorems & Definitions (16)

Theorem 3.1: Optimal $\gamma$
Theorem 3.2: Approximation of Optimal $\gamma$
Theorem C.1: Optimal $\gamma$
proof
Lemma C.1
proof
Lemma C.2
proof
Lemma C.3
proof
...and 6 more

Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

TL;DR

Abstract

Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (16)