Table of Contents
Fetching ...

OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation

Weiqi Li, Shijie Zhao, Chong Mou, Xuhan Sheng, Zhenyu Zhang, Qian Wang, Junlin Li, Li Zhang, Jian Zhang

TL;DR

OmniDrag tackles the challenge of controllable omnidirectional image-to-video generation by introducing an omnidirectional controller and a spherical motion estimator that together enable drag-style scene- and object-level control. The method jointly fine-tunes temporal attention in the base diffusion model and learns spherical motion patterns from a new Move360 dataset featuring large motions, while SME provides accurate training signals and intuitive inference-time control via spherical interpolation. Quantitative and qualitative results demonstrate superior performance over state-of-the-art text- or 2D-control-based methods in FID, FVD, and motion-consistency metrics, as well as in human evaluations. This work advances practical, high-quality ODV generation with user-friendly motion control, supported by the Move360 data resource for future research.

Abstract

As virtual reality gains popularity, the demand for controllable creation of immersive and dynamic omnidirectional videos (ODVs) is increasing. While previous text-to-ODV generation methods achieve impressive results, they struggle with content inaccuracies and inconsistencies due to reliance solely on textual inputs. Although recent motion control techniques provide fine-grained control for video generation, directly applying these methods to ODVs often results in spatial distortion and unsatisfactory performance, especially with complex spherical motions. To tackle these challenges, we propose OmniDrag, the first approach enabling both scene- and object-level motion control for accurate, high-quality omnidirectional image-to-video generation. Building on pretrained video diffusion models, we introduce an omnidirectional control module, which is jointly fine-tuned with temporal attention layers to effectively handle complex spherical motion. In addition, we develop a novel spherical motion estimator that accurately extracts motion-control signals and allows users to perform drag-style ODV generation by simply drawing handle and target points. We also present a new dataset, named Move360, addressing the scarcity of ODV data with large scene and object motions. Experiments demonstrate the significant superiority of OmniDrag in achieving holistic scene-level and fine-grained object-level control for ODV generation. The project page is available at https://lwq20020127.github.io/OmniDrag.

OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video Generation

TL;DR

OmniDrag tackles the challenge of controllable omnidirectional image-to-video generation by introducing an omnidirectional controller and a spherical motion estimator that together enable drag-style scene- and object-level control. The method jointly fine-tunes temporal attention in the base diffusion model and learns spherical motion patterns from a new Move360 dataset featuring large motions, while SME provides accurate training signals and intuitive inference-time control via spherical interpolation. Quantitative and qualitative results demonstrate superior performance over state-of-the-art text- or 2D-control-based methods in FID, FVD, and motion-consistency metrics, as well as in human evaluations. This work advances practical, high-quality ODV generation with user-friendly motion control, supported by the Move360 data resource for future research.

Abstract

As virtual reality gains popularity, the demand for controllable creation of immersive and dynamic omnidirectional videos (ODVs) is increasing. While previous text-to-ODV generation methods achieve impressive results, they struggle with content inaccuracies and inconsistencies due to reliance solely on textual inputs. Although recent motion control techniques provide fine-grained control for video generation, directly applying these methods to ODVs often results in spatial distortion and unsatisfactory performance, especially with complex spherical motions. To tackle these challenges, we propose OmniDrag, the first approach enabling both scene- and object-level motion control for accurate, high-quality omnidirectional image-to-video generation. Building on pretrained video diffusion models, we introduce an omnidirectional control module, which is jointly fine-tuned with temporal attention layers to effectively handle complex spherical motion. In addition, we develop a novel spherical motion estimator that accurately extracts motion-control signals and allows users to perform drag-style ODV generation by simply drawing handle and target points. We also present a new dataset, named Move360, addressing the scarcity of ODV data with large scene and object motions. Experiments demonstrate the significant superiority of OmniDrag in achieving holistic scene-level and fine-grained object-level control for ODV generation. The project page is available at https://lwq20020127.github.io/OmniDrag.

Paper Structure

This paper contains 18 sections, 10 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Overall pipeline of proposed OmniDrag.(a) During training, spherical motion is extracted by the proposed spherical motion estimator. The Omni Controller and temporal attention layers in the UNet denoiser are jointly fine-tuned. (b) During inference, OmniDrag allows users to simply select handle and target points on the reference image and generates ODVs with the corresponding motion.
  • Figure 2: Illustration of our spherical motion estimator (SME). In the training stage, given the input video $\mathbf{V}$, $\mathbf{P}^0$ is firstly initialized through equal area iso-latitude pixelation. Then trajectories $\mathcal{T}$ are tracked, and finally filtered as $\mathcal{T}'$ according to spherical distance via Eqs. (\ref{['eq:track']}-\ref{['eq:filter']}). During inference, given point pairs by users, the trajectories are estimated through spherical interpolation.
  • Figure 3: Our Move360 dataset.(a) We mount Insta360 Titan on a filming car, enabling its movement along four degrees of freedom. (b) Sample frames from the Move360 dataset showcasing a wide range of scenes, including indoor spaces, green landscapes, urban environments, and nighttime settings. This diversity in motion and environments offers a rich dataset for the community.
  • Figure 4: Visual comparisons between DragNUWA yin2023dragnuwa, MotionCtrl wang2024motionctrl, DragAnything wu2025draganything, and our OmniDrag. Our SME estimates reasonable trajectories on the sphere, and OmniDrag achieves precise and stable control under both scene-level (the top case: go forward on the road) and object-level (the bottom case: make the car move along the road) motion conditions, outperforming other methods.
  • Figure 5: Ablation study on jointly fine-tuning temporal attention layers, and training with proposed Move360 dataset. For each ERP image, we show a corresponding viewport at specific perspective.
  • ...and 5 more figures