Table of Contents
Fetching ...

Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding

Guofeng Mei, Bin Ren, Juan Liu, Luigi Riz, Xiaoshui Huang, Xu Zheng, Yongshun Gong, Ming-Hsuan Yang, Nicu Sebe, Fabio Poiesi

TL;DR

Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding introduces S4Token, a scale-invariant, superpoint-guided 3D tokenizer designed to bridge point clouds with frozen 2D foundation models like CLIP. It combines structure-aware oversegmentation, relative-position normalization, and a self-supervised teacher-student pretraining with cross-modal distillation to produce transferable 3D tokens for ViTs, plus a superpoint-aware feature propagation module for dense predictions. Empirical results show strong annotation-free performance in open-vocabulary part/semantic segmentation and zero-shot classification, along with notable cross-domain generalization from ShapeNet to ScanNet/S3DIS, while preserving the frozen CLIP backbone. The work highlights the tokenizer as a true bottleneck and demonstrates a modular, scalable interface that enables label-efficient 3D learning without requiring annotations or backbone fine-tuning, with implications for practical 3D understanding in robotics and AR/VR.

Abstract

Vision-language models like CLIP can offer a promising foundation for 3D scene understanding when extended with 3D tokenizers. However, standard approaches, such as k-nearest neighbor or radius-based tokenization, struggle with cross-domain generalization due to sensitivity to dataset-specific spatial scales. We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone. We show that combining superpoint-based grouping with coordinate scale normalization consistently outperforms conventional methods through extensive experimental analysis. Specifically, we introduce S4Token, a tokenization pipeline that produces semantically-informed tokens regardless of scene scale. Our tokenizer is trained without annotations using masked point modeling and clustering-based objectives, along with cross-modal distillation to align 3D tokens with 2D multi-view image features. For dense prediction tasks, we propose a superpoint-level feature propagation module to recover point-level detail from sparse tokens.

Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding

TL;DR

Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding introduces S4Token, a scale-invariant, superpoint-guided 3D tokenizer designed to bridge point clouds with frozen 2D foundation models like CLIP. It combines structure-aware oversegmentation, relative-position normalization, and a self-supervised teacher-student pretraining with cross-modal distillation to produce transferable 3D tokens for ViTs, plus a superpoint-aware feature propagation module for dense predictions. Empirical results show strong annotation-free performance in open-vocabulary part/semantic segmentation and zero-shot classification, along with notable cross-domain generalization from ShapeNet to ScanNet/S3DIS, while preserving the frozen CLIP backbone. The work highlights the tokenizer as a true bottleneck and demonstrates a modular, scalable interface that enables label-efficient 3D learning without requiring annotations or backbone fine-tuning, with implications for practical 3D understanding in robotics and AR/VR.

Abstract

Vision-language models like CLIP can offer a promising foundation for 3D scene understanding when extended with 3D tokenizers. However, standard approaches, such as k-nearest neighbor or radius-based tokenization, struggle with cross-domain generalization due to sensitivity to dataset-specific spatial scales. We present a universal 3D tokenizer designed for scale-invariant representation learning with a frozen CLIP backbone. We show that combining superpoint-based grouping with coordinate scale normalization consistently outperforms conventional methods through extensive experimental analysis. Specifically, we introduce S4Token, a tokenization pipeline that produces semantically-informed tokens regardless of scene scale. Our tokenizer is trained without annotations using masked point modeling and clustering-based objectives, along with cross-modal distillation to align 3D tokens with 2D multi-view image features. For dense prediction tasks, we propose a superpoint-level feature propagation module to recover point-level detail from sparse tokens.

Paper Structure

This paper contains 35 sections, 15 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Architecture of the proposed S4Token. The teacher generates pseudo assignments via clustering over encoder features, while the student reconstructs masked features using a query decoder and predicts assignment distributions. An assignment loss $\mathcal{L}_{\text{assign}}$ aligns the student’s predictions with the teacher. Additionally, a distillation loss $\mathcal{L}_{\text{distill}}$ aligns 3D patch features with their CLIP counterparts extracted from multi-view images. Symbols: - trained; - frozen; - updated with Exponential Moving Average (EMA) after each iteration. Blue - teacher; Orange - student.
  • Figure 2: Part segmentation results on ShapeNet chang2015shapenet comparing our S4Token (bottom row) using the ViT encoder with PointCLIPV2 zhu2023pointclip and ground-truth annotations (top row).
  • Figure A: Visualization of patch generalization results on ScanNet dai2017scannet using different grouping strategies ($k$NN vs. S4Token). Top row: instance segmentation (ground-truth). Middle row: patch grouping using $k$NN, following PointMAE PointBERT. Bottom row: our S4Token, guided by the superpoint structure. Compared to $k$NN, S4Token produces more compact and semantically consistent patches that better align with object boundaries and scene structure.
  • Figure B: Effect of the weighting exponent $\gamma$ on WFPS, with $\gamma$ varying from $0$ to $1$. The top row shows the anchor points selected by WFPS for different values of $\gamma$, while the bottom row visualizes the corresponding patch groupings formed around those anchors. When $\gamma = 0$, WFPS reduces to classical FPS, producing a nearly uniform, purely position-driven subset that tends to overlook small superpoints. As $\gamma$ increases, the sampling becomes progressively biased toward smaller segments. In the extreme case where $\gamma \rightarrow 1$, this bias may cause large regions to be underrepresented, although basic geometric coverage is still maintained. WFPS thus provides a tunable trade-off between uniform spatial coverage and instance-aware sampling. Empirically, moderate values (e.g., $0.2 \leq \gamma \leq 0.6$) tend to achieve the best balance between geometric regularity and semantic relevance.
  • Figure C: Part segmentation results on ShapeNet chang2015shapenet comparing our S4Token (bottom row) using the ViT encoder with PointCLIPv2 zhu2023pointclip and ground-truth annotations (top row).