Table of Contents
Fetching ...

MeTTA: Single-View to 3D Textured Mesh Reconstruction with Test-Time Adaptation

Kim Yu-Ji, Hyunwoo Ha, Kim Youwang, Jaeheung Surh, Hyowon Ha, Tae-Hyun Oh

TL;DR

MeTTA tackles the challenge of reconstructing accurate 3D textured meshes from a single image, especially under out-of-distribution (OoD) conditions. It introduces test-time adaptation that leverages a pre-trained multi-view diffusion prior, a learnable virtual camera for robust 2D–3D alignment, and DMTet-based geometry with neural PBR texture optimization. The method jointly optimizes shape, texture, and camera using Score-Distillation Sampling alongside photometric, mask, and smoothness losses to produce photorealistic, view-consistent reconstructions that can be used in graphics engines. Experiments on cross-domain and real-world data show improved geometry and texture realism compared with feed-forward and prior-guided iterative approaches, with a practical optimization time of about 30 minutes per object.

Abstract

Reconstructing 3D from a single view image is a long-standing challenge. One of the popular approaches to tackle this problem is learning-based methods, but dealing with the test cases unfamiliar with training data (Out-of-distribution; OoD) introduces an additional challenge. To adapt for unseen samples in test time, we propose MeTTA, a test-time adaptation (TTA) exploiting generative prior. We design joint optimization of 3D geometry, appearance, and pose to handle OoD cases with only a single view image. However, the alignment between the reference image and the 3D shape via the estimated viewpoint could be erroneous, which leads to ambiguity. To address this ambiguity, we carefully design learnable virtual cameras and their self-calibration. In our experiments, we demonstrate that MeTTA effectively deals with OoD scenarios at failure cases of existing learning-based 3D reconstruction models and enables obtaining a realistic appearance with physically based rendering (PBR) textures.

MeTTA: Single-View to 3D Textured Mesh Reconstruction with Test-Time Adaptation

TL;DR

MeTTA tackles the challenge of reconstructing accurate 3D textured meshes from a single image, especially under out-of-distribution (OoD) conditions. It introduces test-time adaptation that leverages a pre-trained multi-view diffusion prior, a learnable virtual camera for robust 2D–3D alignment, and DMTet-based geometry with neural PBR texture optimization. The method jointly optimizes shape, texture, and camera using Score-Distillation Sampling alongside photometric, mask, and smoothness losses to produce photorealistic, view-consistent reconstructions that can be used in graphics engines. Experiments on cross-domain and real-world data show improved geometry and texture realism compared with feed-forward and prior-guided iterative approaches, with a practical optimization time of about 30 minutes per object.

Abstract

Reconstructing 3D from a single view image is a long-standing challenge. One of the popular approaches to tackle this problem is learning-based methods, but dealing with the test cases unfamiliar with training data (Out-of-distribution; OoD) introduces an additional challenge. To adapt for unseen samples in test time, we propose MeTTA, a test-time adaptation (TTA) exploiting generative prior. We design joint optimization of 3D geometry, appearance, and pose to handle OoD cases with only a single view image. However, the alignment between the reference image and the 3D shape via the estimated viewpoint could be erroneous, which leads to ambiguity. To address this ambiguity, we carefully design learnable virtual cameras and their self-calibration. In our experiments, we demonstrate that MeTTA effectively deals with OoD scenarios at failure cases of existing learning-based 3D reconstruction models and enables obtaining a realistic appearance with physically based rendering (PBR) textures.
Paper Structure (47 sections, 6 equations, 23 figures, 5 tables)

This paper contains 47 sections, 6 equations, 23 figures, 5 tables.

Figures (23)

  • Figure 1: Distribution gap between train and test. "Train" refers to a sample on which the Image-to-3D is trained, and "Test" is an in-the-wild sample we captured.
  • Figure 2: Cross-domain evaluation of the single-view to mesh methods. We evaluate on unseen test dataset fu20213dfront.
  • Figure 3: Overview of $\texttt{MeTTA}$. We propose a test-time adaptation pipeline to reconstruct a 3D mesh with PBR texture from a single-view image. "Ref. Image" refers to the reference input image. "Seg. Image" refers to the object-segmented image from "Ref. Image".
  • Figure 4: Ablation studies. To validate our pipeline design, we perform ablation studies where the initial mesh or viewpoint prediction is absent. In the case of a missing initial mesh, we initialize our 3D space with ellipsoid. Canonical viewpoint means that the azimuth and elevation angles are 0$^\circ$.
  • Figure 5: Learnable virtual camera. The reference image is taken with viewpoint ($\theta_\text{ref}, \phi_\text{ref}, r_\text{ref}$), which we estimate and optimize. Green dot means predicted viewpoint given single-view image. Blue dot means canonical viewpoint with both elevation and azimuth angles are 0$^{\circ}$.
  • ...and 18 more figures