Table of Contents
Fetching ...

TTT-KD: Test-Time Training for 3D Semantic Segmentation through Knowledge Distillation from Foundation Models

Lisa Weijler, Muhammad Jehanzeb Mirza, Leon Sick, Can Ekkazan, Pedro Hermosilla

TL;DR

The paper tackles the problem of generalizing 3D semantic segmentation under distribution shifts by introducing TTT-KD, the first test-time training approach for 3D segmentation that uses knowledge distillation from a fixed 2D foundation model as a self-supervised objective. It jointly trains a 3D backbone with two projectors to align 3D features with both segmentation targets and 2D foundation-model features, enabling per-scene adaptation via gradient steps on the KD loss at test time. The method demonstrates strong gains on indoor and outdoor benchmarks, improving both in-distribution and especially out-of-distribution performance, and shows backbone- and foundation-model-agnostic compatibility with notable improvements over strong TTA baselines. By leveraging world-knowledge embedded in foundation models and performing lightweight, per-scene adaptation, TTT-KD offers a practical path toward robust 3D perception in diverse and evolving environments. The work highlights significant practical impact for robotics and autonomous systems operating across varied, unseen domains without requiring target-domain labels or extensive retraining.

Abstract

Test-Time Training (TTT) proposes to adapt a pre-trained network to changing data distributions on-the-fly. In this work, we propose the first TTT method for 3D semantic segmentation, TTT-KD, which models Knowledge Distillation (KD) from foundation models (e.g. DINOv2) as a self-supervised objective for adaptation to distribution shifts at test-time. Given access to paired image-pointcloud (2D-3D) data, we first optimize a 3D segmentation backbone for the main task of semantic segmentation using the pointclouds and the task of 2D $\to$ 3D KD by using an off-the-shelf 2D pre-trained foundation model. At test-time, our TTT-KD updates the 3D segmentation backbone for each test sample, by using the self-supervised task of knowledge distillation, before performing the final prediction. Extensive evaluations on multiple indoor and outdoor 3D segmentation benchmarks show the utility of TTT-KD, as it improves performance for both in-distribution (ID) and out-of-distribution (ODO) test datasets. We achieve a gain of up to 13% mIoU (7% on average) when the train and test distributions are similar and up to 45% (20% on average) when adapting to OOD test samples.

TTT-KD: Test-Time Training for 3D Semantic Segmentation through Knowledge Distillation from Foundation Models

TL;DR

The paper tackles the problem of generalizing 3D semantic segmentation under distribution shifts by introducing TTT-KD, the first test-time training approach for 3D segmentation that uses knowledge distillation from a fixed 2D foundation model as a self-supervised objective. It jointly trains a 3D backbone with two projectors to align 3D features with both segmentation targets and 2D foundation-model features, enabling per-scene adaptation via gradient steps on the KD loss at test time. The method demonstrates strong gains on indoor and outdoor benchmarks, improving both in-distribution and especially out-of-distribution performance, and shows backbone- and foundation-model-agnostic compatibility with notable improvements over strong TTA baselines. By leveraging world-knowledge embedded in foundation models and performing lightweight, per-scene adaptation, TTT-KD offers a practical path toward robust 3D perception in diverse and evolving environments. The work highlights significant practical impact for robotics and autonomous systems operating across varied, unseen domains without requiring target-domain labels or extensive retraining.

Abstract

Test-Time Training (TTT) proposes to adapt a pre-trained network to changing data distributions on-the-fly. In this work, we propose the first TTT method for 3D semantic segmentation, TTT-KD, which models Knowledge Distillation (KD) from foundation models (e.g. DINOv2) as a self-supervised objective for adaptation to distribution shifts at test-time. Given access to paired image-pointcloud (2D-3D) data, we first optimize a 3D segmentation backbone for the main task of semantic segmentation using the pointclouds and the task of 2D 3D KD by using an off-the-shelf 2D pre-trained foundation model. At test-time, our TTT-KD updates the 3D segmentation backbone for each test sample, by using the self-supervised task of knowledge distillation, before performing the final prediction. Extensive evaluations on multiple indoor and outdoor 3D segmentation benchmarks show the utility of TTT-KD, as it improves performance for both in-distribution (ID) and out-of-distribution (ODO) test datasets. We achieve a gain of up to 13% mIoU (7% on average) when the train and test distributions are similar and up to 45% (20% on average) when adapting to OOD test samples.
Paper Structure (71 sections, 1 equation, 6 figures, 5 tables, 1 algorithm)

This paper contains 71 sections, 1 equation, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: TTT-KD. We propose the first test-time training method for 3D semantic segmentation which adapts to distribution shifts at test time. As shown in the illustration above, our method is able to adapt to Out-of-Distribution (OOD) scenes (ScanNet ) where the model was not trained on (S3DIS ). Iteratively, via knowledge distillation from 2D foundation models, our algorithm adjusts the weights of the network, progressively improving prediction with each step (top). Moreover, our approach is able to significantly improve the predictions with even a single step while maintaining the quality of those without degradation over multiple steps (bottom).
  • Figure 2: Given paired image-pointcloud data of a 3D scene, TTT-KD, during joint-training, optimizes the parameters of a point or voxel-based 3D backbone, $\psi_{3D}$, followed by two projectors, $\rho_{\mathcal{Y}}$ and $\rho_{2D}$. While $\rho_{\mathcal{Y}}$ predicts the semantic label of each point, $\rho_{2D}$ is used for knowledge distillation from a frozen 2D foundation model, $\phi_{2D}$. During test-time training, for each test scene, we perform several optimization steps on the self-supervised task of knowledge distillation to fine-tune the parameters of the 3D backbone. Lastly, during inference, we freeze all parameters of the model to perform the final prediction. By improving on the knowledge distillation task during TTT, the model adapts to out-of-distribution 3D scenes different from the source data the model was initially trained on.
  • Figure 3: Qualitative results for two different ODD tasks. The top row presents results for Matterport3D$\to$ScanNet , while the bottom row presents results for ScanNet$\to$Matterport3D . Although other methods are able to slightly adapt to the domain shifts, our TTT-KD algorithm provides more accurate predictions.
  • Figure 4: Ablation studies. (a) Improvement w.r.t. the number of TTT step utilized per scene. (b) mIoU of the model w.r.t. to the number of images used in our TTT-KD. (c) Comparison of DINOv2 oquab2023dinov2 to CLIP radford2021learning and SAM kirillov2023segany as our foundation model.
  • Figure 5: Results obtained when updating the model's parameters using our TTT-KD-O algorithm with stride $s$, i.e. only performing a TTT step every $s$ predictions. Results show that even for large stride values our algorithm produces significant improvements while maintaining low computational cost.
  • ...and 1 more figures