Table of Contents
Fetching ...

I2V3D: Controllable image-to-video generation with 3D guidance

Zhiyuan Zhang, Dongdong Chen, Jing Liao

TL;DR

I2V3D marries traditional computer graphics with diffusion-based synthesis to enable precise 3D control in image-to-video generation from a single image. It introduces a 3D reconstruction and rendering stage, a 3D-guided two-stage video generation pipeline (keyframe generation with LoRA multi-view augmentation and geometric guidance, followed by training-free 3D-guided interpolation), and extensive ablations demonstrating temporal coherence and 3D controllability gains. The approach supports arbitrary starting frames, extended sequences, and 3D scene editing (add/copy/replace/edit objects) while achieving superior quantitative and qualitative results against strong baselines. This framework lowers the professional threshold for CG-quality video creation and offers a flexible path from static imagery to controllable, photorealistic animations.

Abstract

We present I2V3D, a novel framework for animating static images into dynamic videos with precise 3D control, leveraging the strengths of both 3D geometry guidance and advanced generative models. Our approach combines the precision of a computer graphics pipeline, enabling accurate control over elements such as camera movement, object rotation, and character animation, with the visual fidelity of generative AI to produce high-quality videos from coarsely rendered inputs. To support animations with any initial start point and extended sequences, we adopt a two-stage generation process guided by 3D geometry: 1) 3D-Guided Keyframe Generation, where a customized image diffusion model refines rendered keyframes to ensure consistency and quality, and 2) 3D-Guided Video Interpolation, a training-free approach that generates smooth, high-quality video frames between keyframes using bidirectional guidance. Experimental results highlight the effectiveness of our framework in producing controllable, high-quality animations from single input images by harmonizing 3D geometry with generative models. The code for our framework will be publicly released.

I2V3D: Controllable image-to-video generation with 3D guidance

TL;DR

I2V3D marries traditional computer graphics with diffusion-based synthesis to enable precise 3D control in image-to-video generation from a single image. It introduces a 3D reconstruction and rendering stage, a 3D-guided two-stage video generation pipeline (keyframe generation with LoRA multi-view augmentation and geometric guidance, followed by training-free 3D-guided interpolation), and extensive ablations demonstrating temporal coherence and 3D controllability gains. The approach supports arbitrary starting frames, extended sequences, and 3D scene editing (add/copy/replace/edit objects) while achieving superior quantitative and qualitative results against strong baselines. This framework lowers the professional threshold for CG-quality video creation and offers a flexible path from static imagery to controllable, photorealistic animations.

Abstract

We present I2V3D, a novel framework for animating static images into dynamic videos with precise 3D control, leveraging the strengths of both 3D geometry guidance and advanced generative models. Our approach combines the precision of a computer graphics pipeline, enabling accurate control over elements such as camera movement, object rotation, and character animation, with the visual fidelity of generative AI to produce high-quality videos from coarsely rendered inputs. To support animations with any initial start point and extended sequences, we adopt a two-stage generation process guided by 3D geometry: 1) 3D-Guided Keyframe Generation, where a customized image diffusion model refines rendered keyframes to ensure consistency and quality, and 2) 3D-Guided Video Interpolation, a training-free approach that generates smooth, high-quality video frames between keyframes using bidirectional guidance. Experimental results highlight the effectiveness of our framework in producing controllable, high-quality animations from single input images by harmonizing 3D geometry with generative models. The code for our framework will be publicly released.

Paper Structure

This paper contains 23 sections, 2 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Starting with a single image, our method reconstructs the complete scene geometry and uses the CG pipeline to enable precise control of character animation (e.g., keyframe animation or skeleton control) and camera movement (e.g., the camera rotation in the 2nd and 3rd rows, and the camera panning and zooming in the 1st and 4th rows). We then apply geometric guidance, based on the coarse rendering results, to generate high-quality, controllable videos.
  • Figure 2: Our framework consists of three parts. First, we extract meshes from a single input image and use a 3D engine to create and preview a coarse animation. Next, we generate the keyframes using a 3D-guided process with an image diffusion model customized for the input image, incorporating multi-view augmentation and extended attention. Finally, we perform 3D-guided interpolation between generated keyframes to produce a high-quality, consistent video.
  • Figure 3: Qualitative comparison with baselines: (1st) human-like characters, (2nd panel) non-human objects. For human-like characters, MagicPose chang2023magicpose struggles with pose control (blue), and AnimateAnyone hu2024animate fails to preserve appearance (red). For non-human objects, MotionBooth wu2024motionbooth shows overfitting (blue), and DragAnything wu2025draganything shows error accumulation (red). ISculpting yenphraphai2024image exhibits frame inconsistency (yellow) for both categories. Our method outperforms them by following the geometry guidance of coarse renderings but resolves their artifacts (pink).
  • Figure 4: Ablation on LoRA customization with multi-view image augmentation. The red boxes highlight overfitting to the frontal view.
  • Figure 5: Ablation on extended attention for consistency enhancement. The red boxes highlight inconsistencies between individual generated frames.
  • ...and 2 more figures