Table of Contents
Fetching ...

VDCook:DIY video data cook your MLLMs

Chengwei Wu

TL;DR

This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm.

Abstract

We introduce VDCook: a self-evolving video data operating system, a configurable video data construction platform for researchers and vertical domain teams. Users initiate data requests via natural language queries and adjustable parameters (scale, retrieval-synthesis ratio, quality threshold). The system automatically performs query optimization, concurrently running real video retrieval and controlled synthesis modules. It ultimately generates in-domain data packages with complete provenance and metadata, along with reproducible Notebooks. Unlike traditional static, one-time-built datasets, VDCook enables continuous updates and domain expansion through its automated data ingestion mechanism based on MCP (Model Context Protocol)\cite{mcp2024anthropic}, transforming datasets into dynamically evolving open ecosystems. The system also provides multi-dimensional metadata annotation (scene segmentation, motion scoring, OCR ratio, automatic captioning, etc.), laying the foundation for flexible subsequent data `cooking' and indexing\cite{vlogger}. This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm. \textbf{Project demo:} https://screenapp.io/app/v/WP0SvffgsH

VDCook:DIY video data cook your MLLMs

TL;DR

This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm.

Abstract

We introduce VDCook: a self-evolving video data operating system, a configurable video data construction platform for researchers and vertical domain teams. Users initiate data requests via natural language queries and adjustable parameters (scale, retrieval-synthesis ratio, quality threshold). The system automatically performs query optimization, concurrently running real video retrieval and controlled synthesis modules. It ultimately generates in-domain data packages with complete provenance and metadata, along with reproducible Notebooks. Unlike traditional static, one-time-built datasets, VDCook enables continuous updates and domain expansion through its automated data ingestion mechanism based on MCP (Model Context Protocol)\cite{mcp2024anthropic}, transforming datasets into dynamically evolving open ecosystems. The system also provides multi-dimensional metadata annotation (scene segmentation, motion scoring, OCR ratio, automatic captioning, etc.), laying the foundation for flexible subsequent data `cooking' and indexing\cite{vlogger}. This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm. \textbf{Project demo:} https://screenapp.io/app/v/WP0SvffgsH
Paper Structure (32 sections, 14 figures, 2 tables)

This paper contains 32 sections, 14 figures, 2 tables.

Figures (14)

  • Figure 1: High-level architecture of VDCook. The system comprises automated ingestion (MCP), metadata enrichment modules, an index & retrieval service, a controllable synthesis engine, quality & provenance tracking, and a contribution/governance layer.
  • Figure 2: Analysis of our corpus: (a) Resolution distribution showing high-fidelity content; (b) Clip duration distribution primarily within 5--60 s.
  • Figure 3: Representative samples of road waterlogging events. Such scenarios are rare in generic datasets but critical for urban risk monitoring and autonomous systems.
  • Figure 4: Examples of dump trucks in construction environments. These scenes involve heavy machinery and dynamic urban contexts, often underrepresented in general-purpose datasets.
  • Figure 5: Clips of road snow accumulation under various lighting and weather conditions. Such data are important for transportation safety and seasonal robustness.
  • ...and 9 more figures