TTT-KD: Test-Time Training for 3D Semantic Segmentation through Knowledge Distillation from Foundation Models
Lisa Weijler, Muhammad Jehanzeb Mirza, Leon Sick, Can Ekkazan, Pedro Hermosilla
TL;DR
The paper tackles the problem of generalizing 3D semantic segmentation under distribution shifts by introducing TTT-KD, the first test-time training approach for 3D segmentation that uses knowledge distillation from a fixed 2D foundation model as a self-supervised objective. It jointly trains a 3D backbone with two projectors to align 3D features with both segmentation targets and 2D foundation-model features, enabling per-scene adaptation via gradient steps on the KD loss at test time. The method demonstrates strong gains on indoor and outdoor benchmarks, improving both in-distribution and especially out-of-distribution performance, and shows backbone- and foundation-model-agnostic compatibility with notable improvements over strong TTA baselines. By leveraging world-knowledge embedded in foundation models and performing lightweight, per-scene adaptation, TTT-KD offers a practical path toward robust 3D perception in diverse and evolving environments. The work highlights significant practical impact for robotics and autonomous systems operating across varied, unseen domains without requiring target-domain labels or extensive retraining.
Abstract
Test-Time Training (TTT) proposes to adapt a pre-trained network to changing data distributions on-the-fly. In this work, we propose the first TTT method for 3D semantic segmentation, TTT-KD, which models Knowledge Distillation (KD) from foundation models (e.g. DINOv2) as a self-supervised objective for adaptation to distribution shifts at test-time. Given access to paired image-pointcloud (2D-3D) data, we first optimize a 3D segmentation backbone for the main task of semantic segmentation using the pointclouds and the task of 2D $\to$ 3D KD by using an off-the-shelf 2D pre-trained foundation model. At test-time, our TTT-KD updates the 3D segmentation backbone for each test sample, by using the self-supervised task of knowledge distillation, before performing the final prediction. Extensive evaluations on multiple indoor and outdoor 3D segmentation benchmarks show the utility of TTT-KD, as it improves performance for both in-distribution (ID) and out-of-distribution (ODO) test datasets. We achieve a gain of up to 13% mIoU (7% on average) when the train and test distributions are similar and up to 45% (20% on average) when adapting to OOD test samples.
