Table of Contents
Fetching ...

OmniCam: Unified Multimodal Video Generation via Camera Control

Xiaoda Yang, Jiayang Xu, Kaixuan Luan, Xinyu Zhan, Hongshun Qiu, Shijun Shi, Hao Li, Shuai Yang, Li Zhang, Checheng Yu, Cewu Lu, Lixin Yang

TL;DR

OmniCam addresses the challenge of flexible, high-quality camera-controlled video generation by unifying multimodal trajectory guidance and content references. It introduces a three-stage pipeline that converts text or video instructions into discrete motion representations, plans 6DoF camera trajectories, and renders videos through monocular reconstruction plus diffusion-based completion, guided by reinforcement-learning–style optimization. The work introduces OmniTr, the first multimodal camera-control dataset, and demonstrates state-of-the-art performance across quantitative metrics and human evaluations. This framework enables long, complex camera motions with robust cross-modal inputs, offering practical impact for advanced video editing, virtual production, and simulation scenarios.

Abstract

Camera control, which achieves diverse visual effects by changing camera position and pose, has attracted widespread attention. However, existing methods face challenges such as complex interaction and limited control capabilities. To address these issues, we present OmniCam, a unified multimodal camera control framework. Leveraging large language models and video diffusion models, OmniCam generates spatio-temporally consistent videos. It supports various combinations of input modalities: the user can provide text or video with expected trajectory as camera path guidance, and image or video as content reference, enabling precise control over camera motion. To facilitate the training of OmniCam, we introduce the OmniTr dataset, which contains a large collection of high-quality long-sequence trajectories, videos, and corresponding descriptions. Experimental results demonstrate that our model achieves state-of-the-art performance in high-quality camera-controlled video generation across various metrics.

OmniCam: Unified Multimodal Video Generation via Camera Control

TL;DR

OmniCam addresses the challenge of flexible, high-quality camera-controlled video generation by unifying multimodal trajectory guidance and content references. It introduces a three-stage pipeline that converts text or video instructions into discrete motion representations, plans 6DoF camera trajectories, and renders videos through monocular reconstruction plus diffusion-based completion, guided by reinforcement-learning–style optimization. The work introduces OmniTr, the first multimodal camera-control dataset, and demonstrates state-of-the-art performance across quantitative metrics and human evaluations. This framework enables long, complex camera motions with robust cross-modal inputs, offering practical impact for advanced video editing, virtual production, and simulation scenarios.

Abstract

Camera control, which achieves diverse visual effects by changing camera position and pose, has attracted widespread attention. However, existing methods face challenges such as complex interaction and limited control capabilities. To address these issues, we present OmniCam, a unified multimodal camera control framework. Leveraging large language models and video diffusion models, OmniCam generates spatio-temporally consistent videos. It supports various combinations of input modalities: the user can provide text or video with expected trajectory as camera path guidance, and image or video as content reference, enabling precise control over camera motion. To facilitate the training of OmniCam, we introduce the OmniTr dataset, which contains a large collection of high-quality long-sequence trajectories, videos, and corresponding descriptions. Experimental results demonstrate that our model achieves state-of-the-art performance in high-quality camera-controlled video generation across various metrics.

Paper Structure

This paper contains 28 sections, 10 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: An overview of OmniCam. Given diverse modalities of content references and trajectory guidance, OmniCam generates high-quality video sequences by camera motion control. Specifically, OmniCam integrates various combinations of content (e.g., image or video) and trajectory (e.g., text instructions or camera motion from video) references. This approach allows OmniCam to accurately synthesize videos consistent with user-specified inputs.
  • Figure 2: OmniTr dataset consists of four key components: trajectory description, discrete motion representation, trajectory, and corresponding video sequence. Notably, we visualized the discrete motion representations, with the pie chart on the right clearly showing the distribution proportions of various motion attributes. As can be seen, our dataset comprehensively covers all motion attributes.
  • Figure 3: An overview of OmniCam model pipeline. After receiving the trajectory reference, OmniCam first converts it into discrete motion representations through LLM. Subsequently, OmniCam uses a trajectory planning algorithm to calculate the camera pose for each frame based on these motions. Combined with the content reference, OmniCam renders the initial view for each frame. Finally, it utilizes a diffusion model to complete unknown regions in the new viewpoints, and stitch all frames together to generate a coherent video.
  • Figure 4: Text description for camera control. Each set of results demonstrates the generation effects of different types of camera motion combinations, including directional movements at specified angles, rotations, and other complex movements.
  • Figure 5: Video trajectory for camera control. OmniCam transfers the trajectory extracted from the input video to the output video. The first line represents the input and the second line represents the output.
  • ...and 5 more figures