Table of Contents
Fetching ...

Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Zhenwei Wang, Tengfei Wang, Zexin He, Gerhard Hancke, Ziwei Liu, Rynson W. H. Lau

TL;DR

Phidias addresses the challenge of producing high-quality, generalizable, and controllable 3D content from text, image, or 3D references. It introduces a reference-augmented diffusion framework that conditions a multi-view diffusion model on a 3D reference via canonical coordinate maps (CCMs), augmented by meta-ControlNet, dynamic reference routing, and self-reference augmentation, followed by sparse-view 3D reconstruction. The approach yields a first reference-based 3D-aware diffusion model and demonstrates strong improvements over state-of-the-art baselines across image-to-3D tasks, with versatile applications including text-to-3D, retrieval-augmented generation, and interactive 3D creation. The work offers a unified, controllable pipeline for 3D content generation that leverages external references and self-supervised training to enhance realism and generalization in practical scenarios.

Abstract

In 3D modeling, designers often use an existing 3D model as a reference to create new ones. This practice has inspired the development of Phidias, a novel generative model that uses diffusion for reference-augmented 3D generation. Given an image, our method leverages a retrieved or user-provided 3D reference model to guide the generation process, thereby enhancing the generation quality, generalization ability, and controllability. Our model integrates three key components: 1) meta-ControlNet that dynamically modulates the conditioning strength, 2) dynamic reference routing that mitigates misalignment between the input image and 3D reference, and 3) self-reference augmentations that enable self-supervised training with a progressive curriculum. Collectively, these designs result in a clear improvement over existing methods. Phidias establishes a unified framework for 3D generation using text, image, and 3D conditions with versatile applications.

Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

TL;DR

Phidias addresses the challenge of producing high-quality, generalizable, and controllable 3D content from text, image, or 3D references. It introduces a reference-augmented diffusion framework that conditions a multi-view diffusion model on a 3D reference via canonical coordinate maps (CCMs), augmented by meta-ControlNet, dynamic reference routing, and self-reference augmentation, followed by sparse-view 3D reconstruction. The approach yields a first reference-based 3D-aware diffusion model and demonstrates strong improvements over state-of-the-art baselines across image-to-3D tasks, with versatile applications including text-to-3D, retrieval-augmented generation, and interactive 3D creation. The work offers a unified, controllable pipeline for 3D content generation that leverages external references and self-supervised training to enhance realism and generalization in practical scenarios.

Abstract

In 3D modeling, designers often use an existing 3D model as a reference to create new ones. This practice has inspired the development of Phidias, a novel generative model that uses diffusion for reference-augmented 3D generation. Given an image, our method leverages a retrieved or user-provided 3D reference model to guide the generation process, thereby enhancing the generation quality, generalization ability, and controllability. Our model integrates three key components: 1) meta-ControlNet that dynamically modulates the conditioning strength, 2) dynamic reference routing that mitigates misalignment between the input image and 3D reference, and 3) self-reference augmentations that enable self-supervised training with a progressive curriculum. Collectively, these designs result in a clear improvement over existing methods. Phidias establishes a unified framework for 3D generation using text, image, and 3D conditions with versatile applications.
Paper Structure (22 sections, 3 equations, 19 figures, 2 tables)

This paper contains 22 sections, 3 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: The proposed model, Phidias, can produce high-quality 3D assets given 3D references, which can be obtained via retrieval (top two rows) or specified by users (bottom row). It supports 3D generation from a single image, a text prompt, or an existing 3D model.
  • Figure 2: Overview of the Phidias model. It generates a 3D model in two stages: (1) reference-augmented multi-view generation and (2) sparse-view 3D reconstruction.
  • Figure 3: Architectural designs for meta-ControlNet (a) and dynamic reference routing (b).
  • Figure 4: Diverse retrieval-augmented image-to-3D results. Phidias can generate diverse 3D models with different references for a single input image.
  • Figure 5: Qualitative comparisons on image-to-3D generation.
  • ...and 14 more figures