Table of Contents
Fetching ...

MTFusion: Reconstructing Any 3D Object from Single Image Using Multi-word Textual Inversion

Yu Liu, Ruowei Wang, Jiaqi Li, Zixiang Xu, Qijun Zhao

TL;DR

This work proposes MTFusion, which leverages both image data and textual descriptions for high-fidelity 3D reconstruction, and surpasses existing image-to-3D methods on a wide range of synthetic and real-world images.

Abstract

Reconstructing 3D models from single-view images is a long-standing problem in computer vision. The latest advances for single-image 3D reconstruction extract a textual description from the input image and further utilize it to synthesize 3D models. However, existing methods focus on capturing a single key attribute of the image (e.g., object type, artistic style) and fail to consider the multi-perspective information required for accurate 3D reconstruction, such as object shape and material properties. Besides, the reliance on Neural Radiance Fields hinders their ability to reconstruct intricate surfaces and texture details. In this work, we propose MTFusion, which leverages both image data and textual descriptions for high-fidelity 3D reconstruction. Our approach consists of two stages. First, we adopt a novel multi-word textual inversion technique to extract a detailed text description capturing the image's characteristics. Then, we use this description and the image to generate a 3D model with FlexiCubes. Additionally, MTFusion enhances FlexiCubes by employing a special decoder network for Signed Distance Functions, leading to faster training and finer surface representation. Extensive evaluations demonstrate that our MTFusion surpasses existing image-to-3D methods on a wide range of synthetic and real-world images. Furthermore, the ablation study proves the effectiveness of our network designs.

MTFusion: Reconstructing Any 3D Object from Single Image Using Multi-word Textual Inversion

TL;DR

This work proposes MTFusion, which leverages both image data and textual descriptions for high-fidelity 3D reconstruction, and surpasses existing image-to-3D methods on a wide range of synthetic and real-world images.

Abstract

Reconstructing 3D models from single-view images is a long-standing problem in computer vision. The latest advances for single-image 3D reconstruction extract a textual description from the input image and further utilize it to synthesize 3D models. However, existing methods focus on capturing a single key attribute of the image (e.g., object type, artistic style) and fail to consider the multi-perspective information required for accurate 3D reconstruction, such as object shape and material properties. Besides, the reliance on Neural Radiance Fields hinders their ability to reconstruct intricate surfaces and texture details. In this work, we propose MTFusion, which leverages both image data and textual descriptions for high-fidelity 3D reconstruction. Our approach consists of two stages. First, we adopt a novel multi-word textual inversion technique to extract a detailed text description capturing the image's characteristics. Then, we use this description and the image to generate a 3D model with FlexiCubes. Additionally, MTFusion enhances FlexiCubes by employing a special decoder network for Signed Distance Functions, leading to faster training and finer surface representation. Extensive evaluations demonstrate that our MTFusion surpasses existing image-to-3D methods on a wide range of synthetic and real-world images. Furthermore, the ablation study proves the effectiveness of our network designs.

Paper Structure

This paper contains 26 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Given a single-view image, MTFusion generates a textured mesh using supervision from the image and a pseudo-prompt.
  • Figure 2: Overview of MTFusion. Our approach extracts a textual description from the input image and constrains the 3D model based on this description and the image.
  • Figure 3: Overview of our proposed Multi-Word Textual Inversion. The optimization of text embedding is based on a gradient-free approach, which iteratively employs an evolution strategy to explore and exploit pseudo-token embeddings.
  • Figure 4: Qualitative comparison with RealFusion, Make-It-3D and Magic123 on synthetic and real-world images. Note that for real-world images, we first remove the background with u2net, then use the preprocessed RGB-A images for 3D modeling. Due to the utilization of Multi-Word Textual Inversion, our method demonstrates better understanding of global semantic information in the image, e.g., color consistency, as shown in the second row, and spatial relationships, as shown in the fourth row.
  • Figure 5: Ablation study on the enhanced FlexiCubes with hashgrid positional encoding. During the 3D mesh generation process from a given textual description ("A pineapple."), our enhanced FlexiCubes shows better training stability and robustness.