3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding

Yupeng Zhu; Xiongzhen Zhang; Ye Chen; Bingbing Ni

3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding

Yupeng Zhu, Xiongzhen Zhang, Ye Chen, Bingbing Ni

TL;DR

The paper tackles single-image 3D animation by addressing the trade-off between rendering quality and 3D controllability found in traditional pipelines and video-based methods. It introduces a lightweight 3D animation framework based on a 2D–3D aligned proxy embedding, where sparse 3D proxy nodes carry learnable texture features and are rendered via an implicit neural renderer guided by diffusion priors for multi-view consistency. The method supports both interactive animation through a position-based dynamics rigging approach and generative animation via Puppeteer and AnyTop, while ensuring coherent background completion through foreground–background disentanglement. Experiments show the approach achieves efficient animation on low-power GPUs and outperforms video-based methods in identity preservation, geometry, and texture consistency, as well as in controllable interactivity. This proxy-based paradigm offers a scalable path toward accessible, high-quality 3D animation from a single image with potential extensions to multi-object scenes and complex backgrounds.

Abstract

3D animation is central to modern visual media, yet traditional production pipelines remain labor-intensive, expertise-demanding, and computationally expensive. Recent AIGC-based approaches partially automate asset creation and rigging, but they either inherit the heavy costs of full 3D pipelines or rely on video-synthesis paradigms that sacrifice 3D controllability and interactivity. We focus on single-image 3D animation generation and argue that progress is fundamentally constrained by a trade-off between rendering quality and 3D control. To address this limitation, we propose a lightweight 3D animation framework that decouples geometric control from appearance synthesis. The core idea is a 2D-3D aligned proxy representation that uses a coarse 3D estimate as a structural carrier, while delegating high-fidelity appearance and view synthesis to learned image-space generative priors. This proxy formulation enables 3D-aware motion control and interaction comparable to classical pipelines, without requiring accurate geometry or expensive optimization, and naturally extends to coherent background animation. Extensive experiments demonstrate that our method achieves efficient animation generation on low-power platforms and outperforms video-based 3D animation generation in identity preservation, geometric and textural consistency, and the level of precise, interactive control it offers to users.

3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding

TL;DR

Abstract

3DProxyImg: Controllable 3D-Aware Animation Synthesis from Single Image via 2D-3D Aligned Proxy Embedding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)