Table of Contents
Fetching ...

Any-to-3D Generation via Hybrid Diffusion Supervision

Yijun Fan, Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

TL;DR

This work introduces XBind, a unified framework for any-to-3D generation using cross-modal pre-alignment techniques, and presents a novel loss function, termed Modality Similarity (MS) Loss, which aligns the embeddings of the modality prompts and the rendered images, facilitating improved alignment of the 3D objects with multiple modalities.

Abstract

Recent progress in 3D object generation has been fueled by the strong priors offered by diffusion models. However, existing models are tailored to specific tasks, accommodating only one modality at a time and necessitating retraining to change modalities. Given an image-to-3D model and a text prompt, a naive approach is to convert text prompts to images and then use the image-to-3D model for generation. This approach is both time-consuming and labor-intensive, resulting in unavoidable information loss during modality conversion. To address this, we introduce XBind, a unified framework for any-to-3D generation using cross-modal pre-alignment techniques. XBind integrates an multimodal-aligned encoder with pre-trained diffusion models to generate 3D objects from any modalities, including text, images, and audio. We subsequently present a novel loss function, termed Modality Similarity (MS) Loss, which aligns the embeddings of the modality prompts and the rendered images, facilitating improved alignment of the 3D objects with multiple modalities. Additionally, Hybrid Diffusion Supervision combined with a Three-Phase Optimization process improves the quality of the generated 3D objects. Extensive experiments showcase XBind's broad generation capabilities in any-to-3D scenarios. To our knowledge, this is the first method to generate 3D objects from any modality prompts. Project page: https://zeroooooooow1440.github.io/.

Any-to-3D Generation via Hybrid Diffusion Supervision

TL;DR

This work introduces XBind, a unified framework for any-to-3D generation using cross-modal pre-alignment techniques, and presents a novel loss function, termed Modality Similarity (MS) Loss, which aligns the embeddings of the modality prompts and the rendered images, facilitating improved alignment of the 3D objects with multiple modalities.

Abstract

Recent progress in 3D object generation has been fueled by the strong priors offered by diffusion models. However, existing models are tailored to specific tasks, accommodating only one modality at a time and necessitating retraining to change modalities. Given an image-to-3D model and a text prompt, a naive approach is to convert text prompts to images and then use the image-to-3D model for generation. This approach is both time-consuming and labor-intensive, resulting in unavoidable information loss during modality conversion. To address this, we introduce XBind, a unified framework for any-to-3D generation using cross-modal pre-alignment techniques. XBind integrates an multimodal-aligned encoder with pre-trained diffusion models to generate 3D objects from any modalities, including text, images, and audio. We subsequently present a novel loss function, termed Modality Similarity (MS) Loss, which aligns the embeddings of the modality prompts and the rendered images, facilitating improved alignment of the 3D objects with multiple modalities. Additionally, Hybrid Diffusion Supervision combined with a Three-Phase Optimization process improves the quality of the generated 3D objects. Extensive experiments showcase XBind's broad generation capabilities in any-to-3D scenarios. To our knowledge, this is the first method to generate 3D objects from any modality prompts. Project page: https://zeroooooooow1440.github.io/.

Paper Structure

This paper contains 33 sections, 15 equations, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of various methods for Any-to-3D generation: (a) Utilizing separate expert models for Any-to-3D generation. (b) Simply concatenating multimodal-aligned encoder and a 2D diffusion model to achieve Any-to-3D generation. (c) Our proposed XBind. Since there are no existing audio-to-3D models, the audio prompt generated results in (a) are replaced with a question mark.
  • Figure 2: Overview of our method. XBind first encodes the input modality using a multimodal-aligned encoder, mapping it into a shared modality space. This aligned modality is then used as a condition for both 2D and 3D diffusion models. Hybrid diffusion supervision, combining planar and stereoscopic supervision, is applied to optimize the NeRF/Mesh.
  • Figure 3: Examples generated by XBind. The first row represents text-to-3D, the second row represents image-to-3D with the image prompt input located at the bottom left corner of each generated result, and the third row represents audio-to-3D.
  • Figure 4: Qualitative comparison with baselines. The first row represents text-to-3D, the second row represents image-to-3D, and the third row represents audio-to-3D.
  • Figure 5: Qualitative comparison with SOTA methods in the text-to-3D domain.
  • ...and 2 more figures