Table of Contents
Fetching ...

Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation

Xiang Li, Zirui Wang, Zixuan Huang, James M. Rehg

TL;DR

Cue3D provides a model-agnostic framework to dissect how monocular image cues drive single-image 3D generation. By perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity across seven diverse methods and two standard datasets, the study reveals that shape meaningfulness and shading are key drivers of generalization and geometry quality, while texture is less critical. The results show native 3D generative models consistently outperform others in 3D geometry and symmetry, with notable sensitivity to silhouettes and varying dependence on other cues across model families. These insights offer pathways toward more transparent, robust, and controllable single-image 3D generation and highlight the value of cue-aware design and evaluation. The work also provides a comprehensive benchmark and analyses that can guide future research in interpretable and reliable image-to-3D generation pipelines.

Abstract

Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.

Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation

TL;DR

Cue3D provides a model-agnostic framework to dissect how monocular image cues drive single-image 3D generation. By perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity across seven diverse methods and two standard datasets, the study reveals that shape meaningfulness and shading are key drivers of generalization and geometry quality, while texture is less critical. The results show native 3D generative models consistently outperform others in 3D geometry and symmetry, with notable sensitivity to silhouettes and varying dependence on other cues across model families. These insights offer pathways toward more transparent, robust, and controllable single-image 3D generation and highlight the value of cue-aware design and evaluation. The work also provides a comprehensive benchmark and analyses that can guide future research in interpretable and reliable image-to-3D generation pipelines.

Abstract

Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.

Paper Structure

This paper contains 20 sections, 1 equation, 18 figures, 10 tables.

Figures (18)

  • Figure 1: We present Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Left: Our unified evaluation of single-image 3D generation methods. Right: Performance robustness to the perturbation of each cue, lower values indicate higher importance. We show representative methods on Toys4K dataset for clarity; additional figures are available in the Appendix.
  • Figure 2: Overview of perturbations for analyzing individual image cues in single-image 3D generation. Starting from the original image, we systematically perturb specific visual cues. These targeted perturbations reveal the extent to which each cue influences model performance.
  • Figure 3: Illustration of the three single-image 3D generation paradigms evaluated in this paper: regression-based methods (OpenLRM openlrm, SF3D boss2024sf3d), multi-view approaches (CRM wang2024crm, LGM tang2024lgm, InstantMesh xu2024instantmesh), and native 3D generative models (Trellis xiang2024structured, Hunyuan3D-2 zhao2025hunyuan3d).
  • Figure 3: Analysis on the correlation of different cues. We present the Spearman rank correlations ($\rho$) between per-object performance drops in CD for each cue pair. Lower off-diagonal values indicate weaker similarity in object-wise effects; the diagonal is 1 by definition.
  • Figure 4: Qualitative comparison on the Zeroverse dataset of shapes without semantic meaning. We show one methods representative of each paradigms.
  • ...and 13 more figures