Table of Contents
Fetching ...

ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

Wufei Ma, Guanning Zeng, Guofeng Zhang, Qihao Liu, Letian Zhang, Adam Kortylewski, Yaoyao Liu, Alan Yuille

TL;DR

ImageNet3D tackles the challenge of general-purpose object-level 3D understanding by introducing a large-scale real-image dataset that extends 200 ImageNet categories with 6D pose, 3D location, 2D bounding boxes, and captions interleaved with 3D information. It features cross-category 3D alignment of canonical poses and GPT-assisted natural captions to bridge vision and language for 3D reasoning. The paper defines three tasks—object-level 3D awareness probing, open-vocabulary pose estimation, and joint image classification with category-level pose estimation—and provides baseline results across diverse architectures and modeling paradigms (classification heads, 3D meshes, and LLM-augmented models). Findings show that current models possess partial object-level 3D awareness and can generalize to some novel categories, but significant gaps remain, highlighting the need for scalable general-purpose 3D models and further research into cross-category 3D understanding and language-integrated reasoning.

Abstract

A vision model with general-purpose object-level 3D understanding should be capable of inferring both 2D (e.g., class name and bounding box) and 3D information (e.g., 3D location and 3D viewpoint) for arbitrary rigid objects in natural images. This is a challenging task, as it involves inferring 3D information from 2D signals and most importantly, generalizing to rigid objects from unseen categories. However, existing datasets with object-level 3D annotations are often limited by the number of categories or the quality of annotations. Models developed on these datasets become specialists for certain categories or domains, and fail to generalize. In this work, we present ImageNet3D, a large dataset for general-purpose object-level 3D understanding. ImageNet3D augments 200 categories from the ImageNet dataset with 2D bounding box, 3D pose, 3D location annotations, and image captions interleaved with 3D information. With the new annotations available in ImageNet3D, we could (i) analyze the object-level 3D awareness of visual foundation models, and (ii) study and develop general-purpose models that infer both 2D and 3D information for arbitrary rigid objects in natural images, and (iii) integrate unified 3D models with large language models for 3D-related reasoning.. We consider two new tasks, probing of object-level 3D awareness and open vocabulary pose estimation, besides standard classification and pose estimation. Experimental results on ImageNet3D demonstrate the potential of our dataset in building vision models with stronger general-purpose object-level 3D understanding.

ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

TL;DR

ImageNet3D tackles the challenge of general-purpose object-level 3D understanding by introducing a large-scale real-image dataset that extends 200 ImageNet categories with 6D pose, 3D location, 2D bounding boxes, and captions interleaved with 3D information. It features cross-category 3D alignment of canonical poses and GPT-assisted natural captions to bridge vision and language for 3D reasoning. The paper defines three tasks—object-level 3D awareness probing, open-vocabulary pose estimation, and joint image classification with category-level pose estimation—and provides baseline results across diverse architectures and modeling paradigms (classification heads, 3D meshes, and LLM-augmented models). Findings show that current models possess partial object-level 3D awareness and can generalize to some novel categories, but significant gaps remain, highlighting the need for scalable general-purpose 3D models and further research into cross-category 3D understanding and language-integrated reasoning.

Abstract

A vision model with general-purpose object-level 3D understanding should be capable of inferring both 2D (e.g., class name and bounding box) and 3D information (e.g., 3D location and 3D viewpoint) for arbitrary rigid objects in natural images. This is a challenging task, as it involves inferring 3D information from 2D signals and most importantly, generalizing to rigid objects from unseen categories. However, existing datasets with object-level 3D annotations are often limited by the number of categories or the quality of annotations. Models developed on these datasets become specialists for certain categories or domains, and fail to generalize. In this work, we present ImageNet3D, a large dataset for general-purpose object-level 3D understanding. ImageNet3D augments 200 categories from the ImageNet dataset with 2D bounding box, 3D pose, 3D location annotations, and image captions interleaved with 3D information. With the new annotations available in ImageNet3D, we could (i) analyze the object-level 3D awareness of visual foundation models, and (ii) study and develop general-purpose models that infer both 2D and 3D information for arbitrary rigid objects in natural images, and (iii) integrate unified 3D models with large language models for 3D-related reasoning.. We consider two new tasks, probing of object-level 3D awareness and open vocabulary pose estimation, besides standard classification and pose estimation. Experimental results on ImageNet3D demonstrate the potential of our dataset in building vision models with stronger general-purpose object-level 3D understanding.
Paper Structure (34 sections, 1 equation, 8 figures, 8 tables)

This paper contains 34 sections, 1 equation, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of ImageNet3D data and annotations. ImageNet3D provides 3D location and viewpoint (i.e., 6D pose) for more than 86,000 objects. We also annotate cross-category 3D alignment for the 200 rigid categories in ImageNet3D. Lastly we generate image captions interleaved with 3D information to integrate unified 3D models with large language models.
  • Figure 2: Meta classes and cross-category 3D alignment. We align the canonical poses of all $200$ categories based on semantic parts, shapes, and common knowledge. This is crucial for models to benefit from joint learning from multiple categories and to generalize to novel categories.
  • Figure 3: Mis-aligned canonical poses in ObjectNet3D xiang2016objectnet3d.
  • Figure 4: Screenshot of our web app for data annotation.
  • Figure 5: Illustration of open vocabulary pose estimation. Open-vocabulary models may utilize large-scale 2D data, vision-language supervision, or our category descriptions to learn transferable features and generalize to novel rigid categories.
  • ...and 3 more figures