Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models
Mehran Tamjidi, Hamidreza Dastmalchi, Mohammadreza Alimoradijazi, Ali Cheraghian, Aijun An, Morteza Saberi
TL;DR
This work tackles robustness of 3D Vision-Language Foundation Models under distribution shifts by introducing Uni-Adapter, a training-free online test-time adaptation framework. It builds a dynamic, class-specific cache of multiple prototypes to capture intra-class variability, refines pseudo-labels with graph-based label smoothing, and fuses cache predictions with the base model using entropy-weighted aggregation. The approach yields state-of-the-art improvements across corrupted and real-world 3D benchmarks, while maintaining efficient inference with a lightweight cache and iterative but scalable graph smoothing via a conjugate gradient solver. The results demonstrate strong generalization across small-scale and large-scale 3D datasets and model families, highlighting practical impact for real-time 3D vision-language systems. Future work may address cache initialization stability with lightweight self-supervised objectives to further enhance early adaptation without retraining.
Abstract
3D Vision-Language Foundation Models (VLFMs) have shown strong generalization and zero-shot recognition capabilities in open-world point cloud processing tasks. However, these models often underperform in practical scenarios where data are noisy, incomplete, or drawn from a different distribution than the training data. To address this, we propose Uni-Adapter, a novel training-free online test-time adaptation (TTA) strategy for 3D VLFMs based on dynamic prototype learning. We define a 3D cache to store class-specific cluster centers as prototypes, which are continuously updated to capture intra-class variability in heterogeneous data distributions. These dynamic prototypes serve as anchors for cache-based logit computation via similarity scoring. Simultaneously, a graph-based label smoothing module captures inter-prototype similarities to enforce label consistency among similar prototypes. Finally, we unify predictions from the original 3D VLFM and the refined 3D cache using entropy-weighted aggregation for reliable adaptation. Without retraining, Uni-Adapter effectively mitigates distribution shifts, achieving state-of-the-art performance on diverse 3D benchmarks over different 3D VLFMs, improving ModelNet-40C by 10.55%, ScanObjectNN-C by 8.26%, and ShapeNet-C by 4.49% over the source 3D VLFMs. Project page: https://mehran-tam.github.io/Uni-Adapter
