Table of Contents
Fetching ...

3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding

Yupeng Zhu, Xiongzhen Zhang, Ye Chen, Bingbing Ni

TL;DR

The paper tackles single-image 3D animation by addressing the trade-off between rendering quality and 3D controllability found in traditional pipelines and video-based methods. It introduces a lightweight 3D animation framework based on a 2D–3D aligned proxy embedding, where sparse 3D proxy nodes carry learnable texture features and are rendered via an implicit neural renderer guided by diffusion priors for multi-view consistency. The method supports both interactive animation through a position-based dynamics rigging approach and generative animation via Puppeteer and AnyTop, while ensuring coherent background completion through foreground–background disentanglement. Experiments show the approach achieves efficient animation on low-power GPUs and outperforms video-based methods in identity preservation, geometry, and texture consistency, as well as in controllable interactivity. This proxy-based paradigm offers a scalable path toward accessible, high-quality 3D animation from a single image with potential extensions to multi-object scenes and complex backgrounds.

Abstract

3D animation is central to modern visual media, yet traditional production pipelines remain labor-intensive, expertise-demanding, and computationally expensive. Recent AIGC-based approaches partially automate asset creation and rigging, but they either inherit the heavy costs of full 3D pipelines or rely on video-synthesis paradigms that sacrifice 3D controllability and interactivity. We focus on single-image 3D animation generation and argue that progress is fundamentally constrained by a trade-off between rendering quality and 3D control. To address this limitation, we propose a lightweight 3D animation framework that decouples geometric control from appearance synthesis. The core idea is a 2D-3D aligned proxy representation that uses a coarse 3D estimate as a structural carrier, while delegating high-fidelity appearance and view synthesis to learned image-space generative priors. This proxy formulation enables 3D-aware motion control and interaction comparable to classical pipelines, without requiring accurate geometry or expensive optimization, and naturally extends to coherent background animation. Extensive experiments demonstrate that our method achieves efficient animation generation on low-power platforms and outperforms video-based 3D animation generation in identity preservation, geometric and textural consistency, and the level of precise, interactive control it offers to users.

3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding

TL;DR

The paper tackles single-image 3D animation by addressing the trade-off between rendering quality and 3D controllability found in traditional pipelines and video-based methods. It introduces a lightweight 3D animation framework based on a 2D–3D aligned proxy embedding, where sparse 3D proxy nodes carry learnable texture features and are rendered via an implicit neural renderer guided by diffusion priors for multi-view consistency. The method supports both interactive animation through a position-based dynamics rigging approach and generative animation via Puppeteer and AnyTop, while ensuring coherent background completion through foreground–background disentanglement. Experiments show the approach achieves efficient animation on low-power GPUs and outperforms video-based methods in identity preservation, geometry, and texture consistency, as well as in controllable interactivity. This proxy-based paradigm offers a scalable path toward accessible, high-quality 3D animation from a single image with potential extensions to multi-object scenes and complex backgrounds.

Abstract

3D animation is central to modern visual media, yet traditional production pipelines remain labor-intensive, expertise-demanding, and computationally expensive. Recent AIGC-based approaches partially automate asset creation and rigging, but they either inherit the heavy costs of full 3D pipelines or rely on video-synthesis paradigms that sacrifice 3D controllability and interactivity. We focus on single-image 3D animation generation and argue that progress is fundamentally constrained by a trade-off between rendering quality and 3D control. To address this limitation, we propose a lightweight 3D animation framework that decouples geometric control from appearance synthesis. The core idea is a 2D-3D aligned proxy representation that uses a coarse 3D estimate as a structural carrier, while delegating high-fidelity appearance and view synthesis to learned image-space generative priors. This proxy formulation enables 3D-aware motion control and interaction comparable to classical pipelines, without requiring accurate geometry or expensive optimization, and naturally extends to coherent background animation. Extensive experiments demonstrate that our method achieves efficient animation generation on low-power platforms and outperforms video-based 3D animation generation in identity preservation, geometric and textural consistency, and the level of precise, interactive control it offers to users.

Paper Structure

This paper contains 12 sections, 10 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The pipeline of the framework. Given an image, our framework first integrates 3D reconstruction and generative models to obtain a geometrically coherent representation with spatially aligned 2D-3D shapes. Subsequently, through implicit neural rendering and SDS optimization, we derive a 3D asset exhibiting high-fidelity texture and multi-view consistency, which is then utilized to guide the 3D-aware editing and animation process of 2D image.
  • Figure 2: Visual comparison of different methods. Please zoom in for details.
  • Figure 3: Qualitative comparison between Sora and our proposed method across diverse animation tasks. Sora often struggles with precise motion control and physical plausibility in complex scenarios. In contrast, our framework supports rig-driven control (e.g., the breakdancing felt doll) and physical simulation (e.g., the swimming fish and swaying snake), ensuring superior temporal consistency and structural integrity. Our method demonstrates a higher degree of interpretability and controllability by explicitly modeling the underlying 3D skeletal or physical constraints.
  • Figure 4: Ablation study on the effect of our framework.