Table of Contents
Fetching ...

Occlusion-aware Text-Image-Point Cloud Pretraining for Open-World 3D Object Recognition

Khanh Nguyen, Ghulam Mubashar Hassan, Ajmal Mian

TL;DR

Occlusion-aware Text-Image-Point Cloud Pretraining (OccTIP) addresses the domain gap between synthetic full-point-cloud pretraining and real-world occluded data by generating occluded partial point clouds from synthetic models. It couples this with a two-stream, linear-time DuoMamba architecture that uses two space-filling curves and standard 1D convolutions to efficiently model 3D geometry, significantly reducing inference cost compared to Transformer-based encoders. Through cross-modal contrastive learning across text, image, and partial point clouds, OccTIP achieves state-of-the-art or competitive performance on zero-shot, few-shot, and zero-shot detection benchmarks, while reducing FLOPs and latency. The framework demonstrates strong real-world robustness, data efficiency, and practical potential for open-world 3D recognition in robotics and vision systems.

Abstract

Recent open-world representation learning approaches have leveraged CLIP to enable zero-shot 3D object recognition. However, performance on real point clouds with occlusions still falls short due to unrealistic pretraining settings. Additionally, these methods incur high inference costs because they rely on Transformer's attention modules. In this paper, we make two contributions to address these limitations. First, we propose occlusion-aware text-image-point cloud pretraining to reduce the training-testing domain gap. From 52K synthetic 3D objects, our framework generates nearly 630K partial point clouds for pretraining, consistently improving real-world recognition performances of existing popular 3D networks. Second, to reduce computational requirements, we introduce DuoMamba, a two-stream linear state space model tailored for point clouds. By integrating two space-filling curves with 1D convolutions, DuoMamba effectively models spatial dependencies between point tokens, offering a powerful alternative to Transformer. When pretrained with our framework, DuoMamba surpasses current state-of-the-art methods while reducing latency and FLOPs, highlighting the potential of our approach for real-world applications. Our code and data are available at https://ndkhanh360.github.io/project-occtip.

Occlusion-aware Text-Image-Point Cloud Pretraining for Open-World 3D Object Recognition

TL;DR

Occlusion-aware Text-Image-Point Cloud Pretraining (OccTIP) addresses the domain gap between synthetic full-point-cloud pretraining and real-world occluded data by generating occluded partial point clouds from synthetic models. It couples this with a two-stream, linear-time DuoMamba architecture that uses two space-filling curves and standard 1D convolutions to efficiently model 3D geometry, significantly reducing inference cost compared to Transformer-based encoders. Through cross-modal contrastive learning across text, image, and partial point clouds, OccTIP achieves state-of-the-art or competitive performance on zero-shot, few-shot, and zero-shot detection benchmarks, while reducing FLOPs and latency. The framework demonstrates strong real-world robustness, data efficiency, and practical potential for open-world 3D recognition in robotics and vision systems.

Abstract

Recent open-world representation learning approaches have leveraged CLIP to enable zero-shot 3D object recognition. However, performance on real point clouds with occlusions still falls short due to unrealistic pretraining settings. Additionally, these methods incur high inference costs because they rely on Transformer's attention modules. In this paper, we make two contributions to address these limitations. First, we propose occlusion-aware text-image-point cloud pretraining to reduce the training-testing domain gap. From 52K synthetic 3D objects, our framework generates nearly 630K partial point clouds for pretraining, consistently improving real-world recognition performances of existing popular 3D networks. Second, to reduce computational requirements, we introduce DuoMamba, a two-stream linear state space model tailored for point clouds. By integrating two space-filling curves with 1D convolutions, DuoMamba effectively models spatial dependencies between point tokens, offering a powerful alternative to Transformer. When pretrained with our framework, DuoMamba surpasses current state-of-the-art methods while reducing latency and FLOPs, highlighting the potential of our approach for real-world applications. Our code and data are available at https://ndkhanh360.github.io/project-occtip.

Paper Structure

This paper contains 28 sections, 7 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Comparison to existing methods. (a) State-of-the-art approaches pretrain 3D encoders on complete point clouds, which differ significantly from occluded ones in practical scenarios (top). This leads to a substantial gap in zero-shot performance between ModelNet40 modelnet40 benchmark with full point clouds and ScanObjectNN scanobjectnn with real-world data (bottom). (b) The proposed framework OccTIP pretrains 3D models on partial point clouds to better simulate practical conditions, leading to significant improvements on various recognition tasks, especially when combined with our DuoMamba architecture. (c) Compared to the popular PointBERT pointbert, DuoMamba has significantly lower FLOPs (top) and latency (bottom) during inference, making it better suited for real-world applications.
  • Figure 2: Overview of our OccTIP pretraining framework. (a) Given a 3D object, we generate RGB and depth images from preset camera positions, which are used to construct partial point clouds. Texts are generated from dataset metadata, image captioning models blip, and retrieved descriptions of similar photos from LION-5B laion_5b. (b) During pretraining, we extract multi-modal features using a learnable point cloud network and frozen CLIP 2dclip encoders, then align them through contrastive learning.
  • Figure 3: Overview of the proposed architecture and detailed design of our DuoMamba block. We integrate two Hilbert curves hilbert_curve and standard 1D convolutions with linear-time S6 mamba modules to efficiently model geometric dependencies and enrich spatial context.
  • Figure 4: t-SNE visualization of ScanObjectNN scanobjectnn features extracted by different pretraining methods. Compared to other approaches based on complete point clouds, our method OccTIP achieves clearer class separation and significantly reduces overlap between classes.
  • Figure 5: Comparisons of model size and zero-shot accuracy on ScanObjectNNscanobjectnn. Our model is pretrained on 52K ShapeNetCore shapenet objects, whereas all other approaches are pretrained on an ensemble of 880K objects from four datasets: Objaverse objaverse, ABO abo, 3D-FUTURE 3dfuture, and ShapeNetCore shapenet. Despite being pretrained on a less diverse set of objects and having the smallest size, DuoMamba demonstrates competitive performance. Among models with fewer than 50M parameters (DuoMamba, PointBERT pointbert, SparseConv sparseconv), our model outperforms all others by a significant margin of 3% in zero-shot accuracy. While Uni3D-giant uni3d achieves a slightly higher accuracy with a gap of 1.8%, it comes at the cost of a substantially larger model size, with 1016.5M parameters -- 35 times the size of DuoMamba. This highlights the optimal balance between model size and performance offered by our method compared to existing approaches.
  • ...and 2 more figures