Table of Contents
Fetching ...

Sekai: A Video Dataset towards World Exploration

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, Zizhen Li, Fanrui Zhang, Jiaxin Ai, Zhixiang Wang, Yuwei Wu, Tong He, Jiangmiao Pang, Yu Qiao, Yunde Jia, Kaipeng Zhang

TL;DR

Sekai introduces a large-scale, long-form egocentric video dataset for worldwide world exploration, combining YouTube and game footage across 101 countries with rich annotations including location, scene, weather, crowd density, captions, and camera trajectories. A dedicated curation pipeline—encompassing collection, pre-processing, annotation, and sampling—produces Sekai-Real and Sekai-Game, and a HQ subset (Sekai-Real-HQ) supports robust model training. The authors validate annotation quality and demonstrate that fine-tuning video-generation models on Sekai-Real-HQ improves text-to-video and image-to-video generation, while camera-trajectory annotations enable improved interactive video generation. The work establishes Sekai as a valuable resource for training world-exploration models and advancing applications in video understanding, navigation, and multimodal co-generation.

Abstract

Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning "world" in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Comprehensive analyses and experiments demonstrate the dataset's scale, diversity, annotation quality, and effectiveness for training video generation models. We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications. The project page is https://lixsp11.github.io/sekai-project/.

Sekai: A Video Dataset towards World Exploration

TL;DR

Sekai introduces a large-scale, long-form egocentric video dataset for worldwide world exploration, combining YouTube and game footage across 101 countries with rich annotations including location, scene, weather, crowd density, captions, and camera trajectories. A dedicated curation pipeline—encompassing collection, pre-processing, annotation, and sampling—produces Sekai-Real and Sekai-Game, and a HQ subset (Sekai-Real-HQ) supports robust model training. The authors validate annotation quality and demonstrate that fine-tuning video-generation models on Sekai-Real-HQ improves text-to-video and image-to-video generation, while camera-trajectory annotations enable improved interactive video generation. The work establishes Sekai as a valuable resource for training world-exploration models and advancing applications in video understanding, navigation, and multimodal co-generation.

Abstract

Video generation techniques have made remarkable progress, promising to be the foundation of interactive world exploration. However, existing video generation datasets are not well-suited for world exploration training as they suffer from some limitations: limited locations, short duration, static scenes, and a lack of annotations about exploration and the world. In this paper, we introduce Sekai (meaning "world" in Japanese), a high-quality first-person view worldwide video dataset with rich annotations for world exploration. It consists of over 5,000 hours of walking or drone view (FPV and UVA) videos from over 100 countries and regions across 750 cities. We develop an efficient and effective toolbox to collect, pre-process and annotate videos with location, scene, weather, crowd density, captions, and camera trajectories. Comprehensive analyses and experiments demonstrate the dataset's scale, diversity, annotation quality, and effectiveness for training video generation models. We believe Sekai will benefit the area of video generation and world exploration, and motivate valuable applications. The project page is https://lixsp11.github.io/sekai-project/.

Paper Structure

This paper contains 22 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Sekai is collected from Youtube and a video game. It consists of walking and drone-view egocentric videos with recorded audio. We provide rich annotations of camera trajectories, location, crowd density, scene, weather, time of day, and captions.
  • Figure 2: An overview of the Sekai dataset. Sekai-Real is collected from YouTube with high-quality annotations, while Sekai-Game is collected from a game with ground-truth annotations.
  • Figure 3: The dataset curation pipeline. *indicates that the statistics were derived from a subset of trajectory annotations.
  • Figure 4: Statistical information on five dimensions of the Sekai-Real dataset.
  • Figure 5: Statistics of the proposed Sekai-Real and Sekai-Real-HQ dataset.