Table of Contents
Fetching ...

Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models

Mehran Tamjidi, Hamidreza Dastmalchi, Mohammadreza Alimoradijazi, Ali Cheraghian, Aijun An, Morteza Saberi

TL;DR

This work tackles robustness of 3D Vision-Language Foundation Models under distribution shifts by introducing Uni-Adapter, a training-free online test-time adaptation framework. It builds a dynamic, class-specific cache of multiple prototypes to capture intra-class variability, refines pseudo-labels with graph-based label smoothing, and fuses cache predictions with the base model using entropy-weighted aggregation. The approach yields state-of-the-art improvements across corrupted and real-world 3D benchmarks, while maintaining efficient inference with a lightweight cache and iterative but scalable graph smoothing via a conjugate gradient solver. The results demonstrate strong generalization across small-scale and large-scale 3D datasets and model families, highlighting practical impact for real-time 3D vision-language systems. Future work may address cache initialization stability with lightweight self-supervised objectives to further enhance early adaptation without retraining.

Abstract

3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs. Project page: https://mehran-tam.github.io/Uni-Adapter

Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models

TL;DR

This work tackles robustness of 3D Vision-Language Foundation Models under distribution shifts by introducing Uni-Adapter, a training-free online test-time adaptation framework. It builds a dynamic, class-specific cache of multiple prototypes to capture intra-class variability, refines pseudo-labels with graph-based label smoothing, and fuses cache predictions with the base model using entropy-weighted aggregation. The approach yields state-of-the-art improvements across corrupted and real-world 3D benchmarks, while maintaining efficient inference with a lightweight cache and iterative but scalable graph smoothing via a conjugate gradient solver. The results demonstrate strong generalization across small-scale and large-scale 3D datasets and model families, highlighting practical impact for real-time 3D vision-language systems. Future work may address cache initialization stability with lightweight self-supervised objectives to further enhance early adaptation without retraining.

Abstract

3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs. Project page: https://mehran-tam.github.io/Uni-Adapter

Paper Structure

This paper contains 28 sections, 13 equations, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a) t-SNE of Uni3D embeddings for the airplane class in ModelNet40-C shows clear intra-class clustering patterns. Confidence-based prototypes (triangles) cache only high-confidence samples, while cluster-based prototypes (circles) represent distribution modes via online clustering. (b) In the toy example, confidence-based caching leads to incorrect boundaries due to poor mode coverage, whereas cluster-based caching captures diverse patterns and enables correct predictions.
  • Figure 2: Method Overview. Given a test point cloud $\mathbf{X}_t \in \mathbb{R}^{L \times 3}$, our method extracts a point cloud feature $\mathbf{f}_t$ via a point cloud encoder. The 3D cache is updated via online Prototyping, where cluster centers serve as 3D prototypes. The Prototype Reassignment module refines these prototypes, and their affinity with $\mathbf{f}_t$ is computed to obtain $\mathbf{s}^{\text{cache}}$. Finally, the prediction logit $\mathbf{s}^{\text{final}}$ is obtained by fusing $\mathbf{s}^{\text{cache}}$ and the model’s base output $\mathbf{s}^{\text{main}}$ using entropy-driven confidence weighting.
  • Figure 3: Cluster- vs. confidence-based caches in Uni-Adapter on ShapeNet-C. Cluster-based caching gives higher accuracy by capturing diverse modes, while confidence-based caching misses much of the class distribution.
  • Figure 4: Ablation on the number of cluster centers (N) and label smoothing ($\lambda_{reg}$) for ModelNet-40C.
  • Figure E1: Performance with respect to the confidence decay hyperparameter ($\beta$) on the combined ModelNet-40C and ShapeNet-C datasets.
  • ...and 1 more figures