OmniCam: Unified Multimodal Video Generation via Camera Control
Xiaoda Yang, Jiayang Xu, Kaixuan Luan, Xinyu Zhan, Hongshun Qiu, Shijun Shi, Hao Li, Shuai Yang, Li Zhang, Checheng Yu, Cewu Lu, Lixin Yang
TL;DR
OmniCam addresses the challenge of flexible, high-quality camera-controlled video generation by unifying multimodal trajectory guidance and content references. It introduces a three-stage pipeline that converts text or video instructions into discrete motion representations, plans 6DoF camera trajectories, and renders videos through monocular reconstruction plus diffusion-based completion, guided by reinforcement-learning–style optimization. The work introduces OmniTr, the first multimodal camera-control dataset, and demonstrates state-of-the-art performance across quantitative metrics and human evaluations. This framework enables long, complex camera motions with robust cross-modal inputs, offering practical impact for advanced video editing, virtual production, and simulation scenarios.
Abstract
Camera control, which achieves diverse visual effects by changing camera position and pose, has attracted widespread attention. However, existing methods face challenges such as complex interaction and limited control capabilities. To address these issues, we present OmniCam, a unified multimodal camera control framework. Leveraging large language models and video diffusion models, OmniCam generates spatio-temporally consistent videos. It supports various combinations of input modalities: the user can provide text or video with expected trajectory as camera path guidance, and image or video as content reference, enabling precise control over camera motion. To facilitate the training of OmniCam, we introduce the OmniTr dataset, which contains a large collection of high-quality long-sequence trajectories, videos, and corresponding descriptions. Experimental results demonstrate that our model achieves state-of-the-art performance in high-quality camera-controlled video generation across various metrics.
