Table of Contents
Fetching ...

OpenDlign: Open-World Point Cloud Understanding with Depth-Aligned Images

Ye Mao, Junpeng Jing, Krystian Mikolajczyk

TL;DR

OpenDlign is presented, a novel open-world 3D model using depth-aligned images generated from a diffusion model for robust multimodal alignment that achieves high zero-shot and few-shot performance on diverse 3D tasks, despite only fine-tuning 6 million parameters on a limited ShapeNet dataset.

Abstract

Recent open-world 3D representation learning methods using Vision-Language Models (VLMs) to align 3D point cloud with image-text information have shown superior 3D zero-shot performance. However, CAD-rendered images for this alignment often lack realism and texture variation, compromising alignment robustness. Moreover, the volume discrepancy between 3D and 2D pretraining datasets highlights the need for effective strategies to transfer the representational abilities of VLMs to 3D learning. In this paper, we present OpenDlign, a novel open-world 3D model using depth-aligned images generated from a diffusion model for robust multimodal alignment. These images exhibit greater texture diversity than CAD renderings due to the stochastic nature of the diffusion model. By refining the depth map projection pipeline and designing depth-specific prompts, OpenDlign leverages rich knowledge in pre-trained VLM for 3D representation learning with streamlined fine-tuning. Our experiments show that OpenDlign achieves high zero-shot and few-shot performance on diverse 3D tasks, despite only fine-tuning 6 million parameters on a limited ShapeNet dataset. In zero-shot classification, OpenDlign surpasses previous models by 8.0% on ModelNet40 and 16.4% on OmniObject3D. Additionally, using depth-aligned images for multimodal alignment consistently enhances the performance of other state-of-the-art models.

OpenDlign: Open-World Point Cloud Understanding with Depth-Aligned Images

TL;DR

OpenDlign is presented, a novel open-world 3D model using depth-aligned images generated from a diffusion model for robust multimodal alignment that achieves high zero-shot and few-shot performance on diverse 3D tasks, despite only fine-tuning 6 million parameters on a limited ShapeNet dataset.

Abstract

Recent open-world 3D representation learning methods using Vision-Language Models (VLMs) to align 3D point cloud with image-text information have shown superior 3D zero-shot performance. However, CAD-rendered images for this alignment often lack realism and texture variation, compromising alignment robustness. Moreover, the volume discrepancy between 3D and 2D pretraining datasets highlights the need for effective strategies to transfer the representational abilities of VLMs to 3D learning. In this paper, we present OpenDlign, a novel open-world 3D model using depth-aligned images generated from a diffusion model for robust multimodal alignment. These images exhibit greater texture diversity than CAD renderings due to the stochastic nature of the diffusion model. By refining the depth map projection pipeline and designing depth-specific prompts, OpenDlign leverages rich knowledge in pre-trained VLM for 3D representation learning with streamlined fine-tuning. Our experiments show that OpenDlign achieves high zero-shot and few-shot performance on diverse 3D tasks, despite only fine-tuning 6 million parameters on a limited ShapeNet dataset. In zero-shot classification, OpenDlign surpasses previous models by 8.0% on ModelNet40 and 16.4% on OmniObject3D. Additionally, using depth-aligned images for multimodal alignment consistently enhances the performance of other state-of-the-art models.
Paper Structure (24 sections, 2 equations, 9 figures, 11 tables)

This paper contains 24 sections, 2 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Top: Comparison of OpenDlign with traditional open-world 3D learning models. Depth-based (a) and point-based (b) methods employ additional depth or point encoders for pre-training to align with CAD-rendered images. Conversely, OpenDlign (c) fine-tunes only the image encoder, aligning with vividly colored and textured depth-aligned images for enhanced 3D representation. Both rendered and depth-aligned images are utilized solely during training. Bottom: Visual comparison between multi-view CAD-rendered and corresponding depth-aligned images in OpenDlign.
  • Figure 2: Overview of OpenDlign. In (a), OpenDlign converts point clouds into multi-view depth maps using a contour-aware projection, which then helps generate depth-aligned RGB images with diverse textures, geometrically and semantically aligned with the maps. A transformer block, residually connected to the CLIP image encoder, is fine-tuned to align depth maps with depth-aligned images for robust 3D representation. For zero-shot classification (b), OpenDlign aggregates multi-view logits from both pre-trained and fine-tuned encoders for label prediction. For few-shot classification (c), it employs a logistic regressor trained on multi-view features from the encoders.
  • Figure 3: 3D shape retrieval results. (a) Two most similar shapes for each image query. (b) Most similar shapes for each text query. (c) Two most similar shapes for combined image and text queries.
  • Figure 4: Effect of the number of views on OpenDlign's zero-shot performance.
  • Figure 5: Examples of multi-view depth maps and their corresponding depth-aligned images.
  • ...and 4 more figures