Table of Contents
Fetching ...

Objaverse++: Curated 3D Object Dataset with Quality Annotations

Chendi Lin, Heshan Liu, Qunshu Lin, Zachary Bright, Shitao Tang, Yihui He, Minghao Liu, Ling Zhu, Cindy Le

TL;DR

Objaverse++ addresses the quality gap in Objaverse by introducing expert-annotated quality scores and binary traits for 10,000 objects, then scaling annotations to about 500,000 models with a learned classifier. The authors show that pretraining on high-quality subsets improves image-to-3D generation performance and accelerates training convergence, surpassing what is achieved by simply reducing dataset size. They release the annotated subset and demonstrate the approach with quantitative and user-study evidence, arguing that data quality can match or exceed quantity for efficient 3D model training. The work outlines a path toward broader, quality-controlled 3D datasets and notes plans to extend annotations to the entire dataset.

Abstract

This paper presents Objaverse++, a curated subset of Objaverse enhanced with detailed attribute annotations by human experts. Recent advances in 3D content generation have been driven by large-scale datasets such as Objaverse, which contains over 800,000 3D objects collected from the Internet. Although Objaverse represents the largest available 3D asset collection, its utility is limited by the predominance of low-quality models. To address this limitation, we manually annotate 10,000 3D objects with detailed attributes, including aesthetic quality scores, texture color classifications, multi-object composition flags, transparency characteristics, etc. Then, we trained a neural network capable of annotating the tags for the rest of the Objaverse dataset. Through experiments and a user study on generation results, we demonstrate that models pre-trained on our quality-focused subset achieve better performance than those trained on the larger dataset of Objaverse in image-to-3D generation tasks. In addition, by comparing multiple subsets of training data filtered by our tags, our results show that the higher the data quality, the faster the training loss converges. These findings suggest that careful curation and rich annotation can compensate for the raw dataset size, potentially offering a more efficient path to develop 3D generative models. We release our enhanced dataset of approximately 500,000 curated 3D models to facilitate further research on various downstream tasks in 3D computer vision. In the near future, we aim to extend our annotations to cover the entire Objaverse dataset.

Objaverse++: Curated 3D Object Dataset with Quality Annotations

TL;DR

Objaverse++ addresses the quality gap in Objaverse by introducing expert-annotated quality scores and binary traits for 10,000 objects, then scaling annotations to about 500,000 models with a learned classifier. The authors show that pretraining on high-quality subsets improves image-to-3D generation performance and accelerates training convergence, surpassing what is achieved by simply reducing dataset size. They release the annotated subset and demonstrate the approach with quantitative and user-study evidence, arguing that data quality can match or exceed quantity for efficient 3D model training. The work outlines a path toward broader, quality-controlled 3D datasets and notes plans to extend annotations to the entire dataset.

Abstract

This paper presents Objaverse++, a curated subset of Objaverse enhanced with detailed attribute annotations by human experts. Recent advances in 3D content generation have been driven by large-scale datasets such as Objaverse, which contains over 800,000 3D objects collected from the Internet. Although Objaverse represents the largest available 3D asset collection, its utility is limited by the predominance of low-quality models. To address this limitation, we manually annotate 10,000 3D objects with detailed attributes, including aesthetic quality scores, texture color classifications, multi-object composition flags, transparency characteristics, etc. Then, we trained a neural network capable of annotating the tags for the rest of the Objaverse dataset. Through experiments and a user study on generation results, we demonstrate that models pre-trained on our quality-focused subset achieve better performance than those trained on the larger dataset of Objaverse in image-to-3D generation tasks. In addition, by comparing multiple subsets of training data filtered by our tags, our results show that the higher the data quality, the faster the training loss converges. These findings suggest that careful curation and rich annotation can compensate for the raw dataset size, potentially offering a more efficient path to develop 3D generative models. We release our enhanced dataset of approximately 500,000 curated 3D models to facilitate further research on various downstream tasks in 3D computer vision. In the near future, we aim to extend our annotations to cover the entire Objaverse dataset.

Paper Structure

This paper contains 18 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Examples of different quality scores assigned to 3D models. (a) Low Quality: Models with no clear semantic meaning due to lack of identifiable structure. (b) Medium Quality: Identifiable objects but lacking essential material texture and color details. (c) High Quality: Models with clear identity and reasonable aesthetic value, featuring basic textures and colors that convey the character of the object. (d) Superior Quality: Professionally textured models with high semantic clarity and aesthetic harmony.
  • Figure 2: Examples of different binary tags assigned to 3D models. (a) Transparency: Identifies models with see-through parts. (b) Scene: Distinguishes scene-like models from standalone objects, enabling differentiation in 3D model generation suited for environments versus single objects. (c) Single Color: Tags unintentionally monochromatic models, filtering out non-texture-rich objects in texture generation learning. (d) Not a Single Object: Identifies models with multiple separate components, focusing learning on single-object generation tasks. (e) Figure: Marks character or figure models, creating a subset for character generation that may benefit from specialized training.
  • Figure 4: Structure of the annotation network. The network takes 40 different views of 3D models as input, processed by a pre-trained ResNet-50 backbone. Features extracted by the ResNet-50 are passed through an RNN with an attention mechanism to capture spatial dependencies across views. Metadata is fed into a fully connected layer before joining the main pipeline. The combined features are then sent to classification heads for scoring and binary tag predictions.
  • Figure 5: A comparison of image-to-3D generation results by a randomly sampled 100,000 dataset vs. our model.
  • Figure 6: User Study Results. Of the 10 pairs of objects, 8 showed a preference for our model over the baseline. For pairs like the one titled "girl," more than 95% of the participants chose our result. For "mailbox", "hydrant" and "lamp", despite the presence of the no preference option, the majority of the participants chose our generation, proving the quality is significantly higher than baseline.
  • ...and 4 more figures