Table of Contents
Fetching ...

AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes

Yu Li, Menghan Xia, Gongye Liu, Jianhong Bai, Xintao Wang, Conglang Zhang, Yuxuan Lin, Ruihang Chu, Pengfei Wan, Yujiu Yang

TL;DR

AdaViewPlanner tackles automatic viewpoint planning for 4D scenes by repurposing pre-trained text-to-video diffusion models. It introduces a two-stage pipeline: Stage I uses an adaptive-learning branch to inject 4D content into a T2V model and generate cinematically informed video with implicit camera motion, while Stage II adds a dedicated camera extrinsic diffusion branch to extract absolute camera poses aligned to the 4D scene. Training leverages synthetic Unreal Engine data and GVHMR-based alignment, with a curriculum-guided learning scheme to stabilize camera motion generation. Experimental results show substantial improvements over baselines across rule-based, MLLM-based, and user evaluations, highlighting the potential of video-generation models as world priors for 4D interaction and automated cinematography.

Abstract

Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws, indicating its potential as implicit world models. Inspired by this, we explore the feasibility of leveraging the video generation prior for viewpoint planning from given 4D scenes, since videos internally accompany dynamic scenes with natural viewpoints. To this end, we propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction, in a compatible manner. First, we inject the 4D scene representation into the pre-trained T2V model via an adaptive learning branch, where the 4D scene is viewpoint-agnostic and the conditional generated video embeds the viewpoints visually. Then, we formulate viewpoint extraction as a hybrid-condition guided camera extrinsic denoising process. Specifically, a camera extrinsic diffusion branch is further introduced onto the pre-trained T2V model, by taking the generated video and 4D scene as input. Experimental results show the superiority of our proposed method over existing competitors, and ablation studies validate the effectiveness of our key technical designs. To some extent, this work proves the potential of video generation models toward 4D interaction in real world.

AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes

TL;DR

AdaViewPlanner tackles automatic viewpoint planning for 4D scenes by repurposing pre-trained text-to-video diffusion models. It introduces a two-stage pipeline: Stage I uses an adaptive-learning branch to inject 4D content into a T2V model and generate cinematically informed video with implicit camera motion, while Stage II adds a dedicated camera extrinsic diffusion branch to extract absolute camera poses aligned to the 4D scene. Training leverages synthetic Unreal Engine data and GVHMR-based alignment, with a curriculum-guided learning scheme to stabilize camera motion generation. Experimental results show substantial improvements over baselines across rule-based, MLLM-based, and user evaluations, highlighting the potential of video-generation models as world priors for 4D interaction and automated cinematography.

Abstract

Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws, indicating its potential as implicit world models. Inspired by this, we explore the feasibility of leveraging the video generation prior for viewpoint planning from given 4D scenes, since videos internally accompany dynamic scenes with natural viewpoints. To this end, we propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction, in a compatible manner. First, we inject the 4D scene representation into the pre-trained T2V model via an adaptive learning branch, where the 4D scene is viewpoint-agnostic and the conditional generated video embeds the viewpoints visually. Then, we formulate viewpoint extraction as a hybrid-condition guided camera extrinsic denoising process. Specifically, a camera extrinsic diffusion branch is further introduced onto the pre-trained T2V model, by taking the generated video and 4D scene as input. Experimental results show the superiority of our proposed method over existing competitors, and ablation studies validate the effectiveness of our key technical designs. To some extent, this work proves the potential of video generation models toward 4D interaction in real world.

Paper Structure

This paper contains 24 sections, 4 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Showcasing AdaViewPlanner: Given 4D contents and text prompts that depicts scene context and desired camera movements, we adapt pre-trained video diffusion models to generate coordinate-aligned camera pose sequence as well as an corresponding video visualization.
  • Figure 2: (a) Stage I model for motion-conditioned cinematic video generation: a pose encoder processes human motion data (M) from 4D scenes and integrates it with video tokens via spatial motion attention to produce videos with cinematic camera movements. Camera parameters used for guidance are denoted as C. (b) Stage II model: three branches for video, camera, and human motion are combined in an MMDiT framework to extract camera pose.
  • Figure 3: Visualization of results. (Left) Human motion conditions; (Middle) Stage I generated videos; (Right) Stage II generated camera trajectories. AdaViewPlanner demonstrates the ability to design diverse, instruction-consistent, and human-centered camera trajectories.
  • Figure 4: Camera generation results with varied random seed (top), scene context prompt (middle), and camera movement prompt (bottom).
  • Figure 5: Compared with other methods, our model generates smoother trajectories that better follow instructions, while also exhibiting a cinematographic style centered on human actions.
  • ...and 9 more figures