Table of Contents
Fetching ...

Twist and Compute: The Cost of Pose in 3D Generative Diffusion

Kyle Fogarty, Jack Foster, Boqiao Zhang, Jing Yang, Cengiz Öztireli

TL;DR

This work reveals a strong canonical-view bias in a state-of-the-art image-to-3D diffusion pipeline, where inputs rotated away from the canonical pose degrade the generated 3D shapes. By probing with in-plane rotations and a cross-modal ULIP metric, the authors show that bias persists across airplanes, chairs, and cars and is not remedied by simply increasing diffusion steps. They propose a lightweight CNN-based orientation corrector that detects and re-canonicalizes input images before 3D generation, restoring performance without altering the generative backbone. The findings argue for incorporating symmetry-aware or modular designs in large-scale 3D generative systems to achieve robust, viewpoint-consistent outputs in practical applications.

Abstract

Despite their impressive results, large-scale image-to-3D generative models remain opaque in their inductive biases. We identify a significant limitation in image-conditioned 3D generative models: a strong canonical view bias. Through controlled experiments using simple 2D rotations, we show that the state-of-the-art Hunyuan3D 2.0 model can struggle to generalize across viewpoints, with performance degrading under rotated inputs. We show that this failure can be mitigated by a lightweight CNN that detects and corrects input orientation, restoring model performance without modifying the generative backbone. Our findings raise an important open question: Is scale enough, or should we pursue modular, symmetry-aware designs?

Twist and Compute: The Cost of Pose in 3D Generative Diffusion

TL;DR

This work reveals a strong canonical-view bias in a state-of-the-art image-to-3D diffusion pipeline, where inputs rotated away from the canonical pose degrade the generated 3D shapes. By probing with in-plane rotations and a cross-modal ULIP metric, the authors show that bias persists across airplanes, chairs, and cars and is not remedied by simply increasing diffusion steps. They propose a lightweight CNN-based orientation corrector that detects and re-canonicalizes input images before 3D generation, restoring performance without altering the generative backbone. The findings argue for incorporating symmetry-aware or modular designs in large-scale 3D generative systems to achieve robust, viewpoint-consistent outputs in practical applications.

Abstract

Despite their impressive results, large-scale image-to-3D generative models remain opaque in their inductive biases. We identify a significant limitation in image-conditioned 3D generative models: a strong canonical view bias. Through controlled experiments using simple 2D rotations, we show that the state-of-the-art Hunyuan3D 2.0 model can struggle to generalize across viewpoints, with performance degrading under rotated inputs. We show that this failure can be mitigated by a lightweight CNN that detects and corrects input orientation, restoring model performance without modifying the generative backbone. Our findings raise an important open question: Is scale enough, or should we pursue modular, symmetry-aware designs?

Paper Structure

This paper contains 18 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Equivariance between world and observation spaces preserves identity.
  • Figure 2: Rotating the input image breaks the image-to-3D pipeline (canonical-view bias), while a lightweight CNN predicts the rotation and applies its inverse to re-canonicalize the image before 3D generation, restoring performance. Note that the degraded 3D mesh is rotated to the canonical frame for better visualization.
  • Figure 3: ULIP similarity (higher is better) versus diffusion inference steps for Hunyuan3D 2.0 under four in-plane input rotations $\{0^\circ, 90^\circ, 180^\circ, 270^\circ\}$, shown separately for (a) airplanes, (b) chairs, and (c) cars. Across all categories, the canonical $0^\circ$ view consistently achieves the highest ULIP, while non-canonical rotations suffer substantial drops.
  • Figure 4: Qualitative effect of input rotation on Hunyuan3D 2.0. For each object, the input image is rotated by {0°, 90°, 180°, 270°}. To ensure comparability, all generated meshes are reoriented to a common camera before being rendered from canonical front and side views. Non-canonical inputs induce systematic geometric failures (e.g., collapsed airplane wings, misaligned/duplicated chair legs), whereas the 0° view remains stable, illustrating a strong canonical-view bias.