Sekai: A Video Dataset towards World Exploration

Zhen Li; Chuanhao Li; Xiaofeng Mao; Shaoheng Lin; Ming Li; Shitian Zhao; Zhaopan Xu; Xinyue Li; Yukang Feng; Jianwen Sun; Zizhen Li; Fanrui Zhang; Jiaxin Ai; Zhixiang Wang; Yuwei Wu; Tong He; Jiangmiao Pang; Yu Qiao; Yunde Jia; Kaipeng Zhang

Sekai: A Video Dataset towards World Exploration

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, Zizhen Li, Fanrui Zhang, Jiaxin Ai, Zhixiang Wang, Yuwei Wu, Tong He, Jiangmiao Pang, Yu Qiao, Yunde Jia, Kaipeng Zhang

TL;DR

Sekai introduces a large-scale, long-form egocentric video dataset for worldwide world exploration, combining YouTube and game footage across 101 countries with rich annotations including location, scene, weather, crowd density, captions, and camera trajectories. A dedicated curation pipeline—encompassing collection, pre-processing, annotation, and sampling—produces Sekai-Real and Sekai-Game, and a HQ subset (Sekai-Real-HQ) supports robust model training. The authors validate annotation quality and demonstrate that fine-tuning video-generation models on Sekai-Real-HQ improves text-to-video and image-to-video generation, while camera-trajectory annotations enable improved interactive video generation. The work establishes Sekai as a valuable resource for training world-exploration models and advancing applications in video understanding, navigation, and multimodal co-generation.

Abstract

Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning "world" in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Comprehensive analyses and experiments demonstrate the dataset's scale, diversity, annotation quality, and effectiveness for training video generation models. We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications. The project page is https://lixsp11.github.io/sekai-project/.

Sekai: A Video Dataset towards World Exploration

TL;DR

Abstract

Sekai: A Video Dataset towards World Exploration

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)