Table of Contents
Fetching ...

Captain Safari: A World Engine

Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, Junfei Xiao

TL;DR

Captain Safari tackles long-horizon, 3D-consistent FPV video generation under aggressive 6-DoF motion. It introduces a pose-conditioned world memory with a local, pose-aware retrieval mechanism to supply world tokens that condition a diffusion-based generator, maintaining coherent geometry along user-defined trajectories. To benchmark, the authors release OpenSafari, a large-scale in-the-wild FPV drone dataset with verified camera poses. Results show state-of-the-art 3D consistency and trajectory following, with strong perceptual quality and a majority of human preferences, validating the approach and dataset as a challenging benchmark for future world-engine research.

Abstract

World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. More importantly, in a 50-participant, 5-way human study where annotators select the best result among five anonymized models, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide OpenSafari as a challenging new benchmark for future world-engine research.

Captain Safari: A World Engine

TL;DR

Captain Safari tackles long-horizon, 3D-consistent FPV video generation under aggressive 6-DoF motion. It introduces a pose-conditioned world memory with a local, pose-aware retrieval mechanism to supply world tokens that condition a diffusion-based generator, maintaining coherent geometry along user-defined trajectories. To benchmark, the authors release OpenSafari, a large-scale in-the-wild FPV drone dataset with verified camera poses. Results show state-of-the-art 3D consistency and trajectory following, with strong perceptual quality and a majority of human preferences, validating the approach and dataset as a challenging benchmark for future world-engine research.

Abstract

World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing high-dynamic drone videos with verified camera trajectories, constructed through a multi-stage geometric and kinematic validation pipeline. Across video quality, 3D consistency, and trajectory following, Captain Safari substantially outperforms state-of-the-art camera-controlled generators. It reduces MEt3R from 0.3703 to 0.3690, improves AUC@30 from 0.181 to 0.200, and yields substantially lower FVD than all camera-controlled baselines. More importantly, in a 50-participant, 5-way human study where annotators select the best result among five anonymized models, 67.6% of preferences favor our method across all axes. Our results demonstrate that pose-conditioned world memory is a powerful mechanism for long-horizon, controllable video generation and provide OpenSafari as a challenging new benchmark for future world-engine research.

Paper Structure

This paper contains 18 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Captain Safari is a pose-aware world engine that generates long-horizon, 3D-consistent FPV videos from any user-specified camera trajectory. By retrieving pose-aligned world memory, it keeps geometry stable across large viewpoint changes and reconstructs crisp, well-formed structures while faithfully tracking aggressive 6-DoF motion.
  • Figure 2: Method overview.Captain Safari builds a local world memory and, given a query camera pose, retrieves pose-aligned tokens that summarize the scene. These tokens then condition video generation along the user-specified trajectory, preserving a stable 3D layout.
  • Figure 3: OpenSafari. A new in-the-wild FPV dataset with rigorously verified camera trajectories, designed to stress-test geometry-consistent, camera-controllable video generation. We curate clips through a compact, multi-stage pipeline that filters, reconstructs, and verifies trajectories, yielding clean, motion-rich videos with reliable camera paths.
  • Figure 4: Qualitative comparisons.Left: Baselines—including the memory-removed variant—exhibit abrupt popping/vanishing of the school bus, and GF is low-quality. Captain Safari alone renders the bus smoothly exiting the frame. Right: Baselines distort or lose field marking, with Wan2.2 collapsing under large camera motion, affirming the challenge of 3D consistency under rapid trajectories. Captain Safari preserves crisp markings and coherent layout while following the fast 6-DoF path.
  • Figure 5: Scene reconstruction and camera trajectory. With pose-aligned memory, Captain Safari reconstructs a well-structured building façade (the memory-removed variant blurs/warps it), demonstrating the benefit of memory. It also preserves fine details—parked cars and the tree on their roofs—that Wan2.2-5B fails to retain. Meanwhile, Real-CamI2V follows only a short path, whereas Captain Safari covers the full trajectory with stable 3D structure, highlighting the challenge of maintaining 3D consistency under fast motion.