Interactive Test-Time Adaptation with Reliable Spatial-Temporal Voxels for Multi-Modal Segmentation

Haozhi Cao; Yuecong Xu; Pengyu Yin; Xingyu Ji; Shenghai Yuan; Jianfei Yang; Lihua Xie

Interactive Test-Time Adaptation with Reliable Spatial-Temporal Voxels for Multi-Modal Segmentation

Haozhi Cao, Yuecong Xu, Pengyu Yin, Xingyu Ji, Shenghai Yuan, Jianfei Yang, Lihua Xie

TL;DR

The paper tackles the problem of unstable and biased predictions in multi-modal test-time adaptation for 3D semantic segmentation. It introduces Latte++ to robustly estimate spatial-temporal voxel reliability via multi-window aggregation and ITTA to incorporate minimal human feedback through a promptable branch and momentum gradient, enabling effective online correction. Across five benchmarks, Latte++ delivers strong cross-modal gains and ITTA yields consistent improvements for challenging, imbalanced classes, with complementary benefits observed when combined. The work demonstrates the practicality of online, interactive, cross-modal adaptation for robust perception in changing environments, potentially influencing real-time autonomous systems. All techniques are validated with extensive ablations and qualitative results, underscoring the value of temporal consistency and user-guided refinement in MM-TTA.

Abstract

Multi-modal test-time adaptation (MM-TTA) adapts models to an unlabeled target domain by leveraging the complementary multi-modal inputs in an online manner. While previous MM-TTA methods for 3D segmentation offer a promising solution by leveraging self-refinement per frame, they suffer from two major limitations: 1) unstable frame-wise predictions caused by temporal inconsistency, and 2) consistently incorrect predictions that violate the assumption of reliable modality guidance. To address these limitations, this work introduces a comprehensive two-fold framework. Firstly, building upon our previous work ReLiable Spatial-temporal Voxels (Latte), we propose Latte++ that better suppresses the unstable frame-wise predictions with more informative geometric correspondences. Instead of utilizing a universal sliding window, Latte++ employs multi-window aggregation to capture more reliable correspondences to better evaluate the local prediction consistency of different semantic categories. Secondly, to tackle the consistently incorrect predictions, we propose Interactive Test-Time Adaptation (ITTA), a flexible add-on to empower effortless human feedback with existing MM-TTA methods. ITTA introduces a novel human-in-the-loop approach that efficiently integrates minimal human feedback through interactive segmentation, requiring only simple point clicks and bounding box annotations. Instead of using independent interactive networks, ITTA employs a lightweight promptable branch with a momentum gradient module to capture and reuse knowledge from scarce human feedback during online inference. Extensive experiments across five MM-TTA benchmarks demonstrate that ITTA achieves consistent and notable improvements with robust performance gains for target classes of interest in challenging imbalanced scenarios, while Latte++ provides complementary benefits for temporal stability.

Interactive Test-Time Adaptation with Reliable Spatial-Temporal Voxels for Multi-Modal Segmentation

TL;DR

Abstract

Paper Structure (30 sections, 19 equations, 12 figures, 10 tables, 1 algorithm)

This paper contains 30 sections, 19 equations, 12 figures, 10 tables, 1 algorithm.

Introduction
Related Works
Methodology
Latte and Latte++
Frame-wise Predictions from Students and Teachers
Sliding-Window Aggregation and Voxelization
Spatial-Temporal Voxels and Entropy
Latte++: Multi-Window Aggregation and Intra-Modal Attending
ST Voxel Aided Cross-Modal Learning
Online Predictions and Optimization
Interactive Test-Time Adaptation
Prior to Interactive Test-Time Adaptation
Online Optimization Objectives and Momentum Grad
Experimental Results
Benchmarks and settings
...and 15 more sections

Figures (12)

Figure 1: Illustration of two types of challenging noisy prediction and our proposed methods. (a) demonstrates the unstable frame-wise predictions between consecutive frames from the state-of-the-art MM-TTA method shin2022mm highlighted in white boxes. To suppress this frame-wise instability, our Latte and Latte++ achieve reliable cross-modal attending by estimating Spatial-Temporal (ST) entropy (i.e., $E^{\mathrm{2D}}$and $E^{\mathrm{3D}}$) within each ST voxel. In terms of (b) consistently incorrect predictions (e.g., the consistently misclassified pedestrian in white boxes), we propose Interactive Test-Time Adaptation (ITTA) by leveraging effortless human feedback for instant refinement and continual regularization.
Figure 2: Overall structure of Latte++. Taking a student prediction frame of one modality as the query input, our sliding-window aggregation searches its spatial-temporal correspondences through voxelization within a time window to establish the temporally local prediction consistency. ST voxels are then generated, where those with high ST entropy (larger than $\alpha$-quantile) are discarded as unreliable correspondences, while the others are leveraged for adaptive cross-modal learning by attending to the modality with lower ST entropy in a voxel-wise manner.
Figure 3: Overall pipeline of ITTA. Before the online adaptation (left), a promptable branch is first created by inserting a light-weight bottleneck after the encoder (ENC) of the 2D network, and then warm up only the bottleneck by distilling the knowledge from interactive visual foundation models (e.g., SAM kirillov2023segment). Meanwhile, class-wise feature centroids are generated by feature inversion using the pre-trained classifier. During the online adaptation process (right), when receiving human prompts, the pre-trained promptable branch decodes a corresponding instance mask, based on which we will capture this valuable knowledge by recording its loss and gradient, and reuse it to update the network in the following iterations.
Figure 4: Single modal predictions comparison (mIoU) on K-to-S and K-to-W. We report the performance of Latte++ for each sequence (highlighted in red if any improvement), followed by the absolute gaps compared to the previous sequence-wise second highest performance in the round bracket.
Figure 5: Ablation studies of different frame aggregation mechanisms. Besides (a) Latte's sliding-window aggregation, we test two different aggregation methods, including (b) replacing single frame student predictions with multiple frames in Equation (\ref{['Eq: ST Voxel Reference']}) and (c) adding non-overlapping aggregation to (b). Results show that our sliding-window aggregation can better evaluate the local consistency, which leads to more consistent improvement.
...and 7 more figures

Interactive Test-Time Adaptation with Reliable Spatial-Temporal Voxels for Multi-Modal Segmentation

TL;DR

Abstract

Interactive Test-Time Adaptation with Reliable Spatial-Temporal Voxels for Multi-Modal Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)