Tiger200K: Manually Curated High Visual Quality Video Dataset from UGC Platform
Xianpan Zhou
TL;DR
This paper addresses the shortage of high-visual-quality open data for text-to-video models by introducing Tiger200K, a manually curated dataset sourced from UGC platforms to emphasize visual fidelity. It presents a practical data-construction pipeline combining manual curation, TransNetV2-based scene segmentation, safe-zone cropping via OCR and border analysis, motion filtering, and bilingual captions generated by a visual LLM. The authors report 85k scene segments and 170k video clips from 4151 videos, with substantial portions at 4K+ resolution and tight quality controls, including safe-zone retention and bilingual captioning. The work aims to enable more effective post-training and quality-tuning of video generation models and envisions ongoing expansion and open-source releases to accelerate research in open video generation.
Abstract
The recent surge in open-source text-to-video generation models has significantly energized the research community, yet their dependence on proprietary training datasets remains a key constraint. While existing open datasets like Koala-36M employ algorithmic filtering of web-scraped videos from early platforms, they still lack the quality required for fine-tuning advanced video generation models. We present Tiger200K, a manually curated high visual quality video dataset sourced from User-Generated Content (UGC) platforms. By prioritizing visual fidelity and aesthetic quality, Tiger200K underscores the critical role of human expertise in data curation, and providing high-quality, temporally consistent video-text pairs for fine-tuning and optimizing video generation architectures through a simple but effective pipeline including shot boundary detection, OCR, border detecting, motion filter and fine bilingual caption. The dataset will undergo ongoing expansion and be released as an open-source initiative to advance research and applications in video generative models. Project page: https://tinytigerpan.github.io/tiger200k/
