Table of Contents
Fetching ...

3DCoMPaT$^{++}$: An improved Large-scale 3D Vision Dataset for Compositional Recognition

Habib Slim, Xiang Li, Yuchen Li, Mahmoud Ahmed, Mohamed Ayman, Ujjwal Upadhyay, Ahmed Abdelreheem, Arpit Prajapati, Suhail Pothigara, Peter Wonka, Mohamed Elhoseiny

TL;DR

3DCoMPaT++ introduces a large-scale multimodal 2D/3D dataset of 10M stylized shapes across 42 categories, with 275 fine-grained parts, 43 coarse-grained parts, and 293 materials, rendered to yield 160M views. It provides rich part- and material-level annotations in both 2D and 3D, along with texture coordinates and a comprehensive rendering pipeline, enabling the Grounded CoMPaT Recognition (GCR) task that jointly identifies shape categories and their part-material compositions. The paper validates the dataset through 2D/3D classification and segmentation experiments and presents a CVPR 2023 challenge on GCR, offering baselines (e.g., PointNet++, BPNet, SegFormer) and insights into multimodal grounding, while also releasing a toolbox for data access and visualization. This resource advances compositional 3D vision by enabling scalable, multi-view, multimodal analysis of material-part interactions and providing a benchmark for grounding complex shape formulations across modalities.

Abstract

In this work, we present 3DCoMPaT$^{++}$, a multimodal 2D/3D dataset with 160 million rendered views of more than 10 million stylized 3D shapes carefully annotated at the part-instance level, alongside matching RGB point clouds, 3D textured meshes, depth maps, and segmentation masks. 3DCoMPaT$^{++}$ covers 41 shape categories, 275 fine-grained part categories, and 293 fine-grained material classes that can be compositionally applied to parts of 3D objects. We render a subset of one million stylized shapes from four equally spaced views as well as four randomized views, leading to a total of 160 million renderings. Parts are segmented at the instance level, with coarse-grained and fine-grained semantic levels. We introduce a new task, called Grounded CoMPaT Recognition (GCR), to collectively recognize and ground compositions of materials on parts of 3D objects. Additionally, we report the outcomes of a data challenge organized at CVPR2023, showcasing the winning method's utilization of a modified PointNet$^{++}$ model trained on 6D inputs, and exploring alternative techniques for GCR enhancement. We hope our work will help ease future research on compositional 3D Vision.

3DCoMPaT$^{++}$: An improved Large-scale 3D Vision Dataset for Compositional Recognition

TL;DR

3DCoMPaT++ introduces a large-scale multimodal 2D/3D dataset of 10M stylized shapes across 42 categories, with 275 fine-grained parts, 43 coarse-grained parts, and 293 materials, rendered to yield 160M views. It provides rich part- and material-level annotations in both 2D and 3D, along with texture coordinates and a comprehensive rendering pipeline, enabling the Grounded CoMPaT Recognition (GCR) task that jointly identifies shape categories and their part-material compositions. The paper validates the dataset through 2D/3D classification and segmentation experiments and presents a CVPR 2023 challenge on GCR, offering baselines (e.g., PointNet++, BPNet, SegFormer) and insights into multimodal grounding, while also releasing a toolbox for data access and visualization. This resource advances compositional 3D vision by enabling scalable, multi-view, multimodal analysis of material-part interactions and providing a benchmark for grounding complex shape formulations across modalities.

Abstract

In this work, we present 3DCoMPaT, a multimodal 2D/3D dataset with 160 million rendered views of more than 10 million stylized 3D shapes carefully annotated at the part-instance level, alongside matching RGB point clouds, 3D textured meshes, depth maps, and segmentation masks. 3DCoMPaT covers 41 shape categories, 275 fine-grained part categories, and 293 fine-grained material classes that can be compositionally applied to parts of 3D objects. We render a subset of one million stylized shapes from four equally spaced views as well as four randomized views, leading to a total of 160 million renderings. Parts are segmented at the instance level, with coarse-grained and fine-grained semantic levels. We introduce a new task, called Grounded CoMPaT Recognition (GCR), to collectively recognize and ground compositions of materials on parts of 3D objects. Additionally, we report the outcomes of a data challenge organized at CVPR2023, showcasing the winning method's utilization of a modified PointNet model trained on 6D inputs, and exploring alternative techniques for GCR enhancement. We hope our work will help ease future research on compositional 3D Vision.
Paper Structure (12 sections, 7 equations, 26 figures, 6 tables, 1 algorithm)

This paper contains 12 sections, 7 equations, 26 figures, 6 tables, 1 algorithm.

Figures (26)

  • Figure 1: Data provided for each stylized shape. For 3D: RGB pointclouds, textured shapes, and point-wise/triangle-wise part labels and material labels. For 2D: RGB images, depth maps, and corresponding part masks and material masks. Part and material annotations in 2D and 3D are provided in both coarse and fine semantic levels. In Figure \ref{['fig:style_variants']} of the appendix, we show additional style variants for various shapes.
  • Figure 2: Grounded CoMPaT Recognition (GCR). Given an input shape, here: a chair, the task consists of (a) recognizing the shape category and (b) segmenting the part-material pairs composing it.
  • Figure 3: Comparison with PartNet. Part instances per shape distributions compared to PartNet mo_partnet_2018 (left), Density plots depicting the distribution of vertex counts, edge counts, and face counts across 3D shapes extracted from both the 3DCoMPaT++ and ShapeNet datasets (right). We show significantly higher numbers of vertices, edges, and faces in 3DCoMPaT++ shapes compared to ShapeNet. All annotated shapes in PartNet originate from ShapeNet.
  • Figure 4: Detailing the data pipeline of 3DCoMPaT++. Starting from the collection of 3D shapes, we perform a first editing step consisting of model re-scaling, UV map correction and removal of undesirable meshes. Material compatibility information is also collected for each part of each shape, alongside the shape category. Shapes are then annotated at a fine-grained part-instance level, and part names are iteratively refined and uniformized using a web-based shape visualizer. Misaligned shapes are semi-automatically realigned using part annotations as a prior. Finally, we sample a set of materials for each part of each shape, and render each stylized shape from multiple viewpoints.
  • Figure 5: Rendering of randomly sampled shapes from 3DCoMPaT++. The dataset comprises a rich collection of stylized 3D shapes annotated at the part-instance level. These renderings demonstrate the varying shapes, styles, and materials that are captured, enabling comprehensive exploration and analysis of compositional 3D vision tasks. Shapes are consistently aligned across classes and orientations are consistent for all 3D models. In the left circle, we illustrate the untextured 3D geometries we start from as a reference. We provide additional reference shapes from all 42 shape categories in Figure \ref{['fig:all_classes']} of the appendix.
  • ...and 21 more figures