Table of Contents
Fetching ...

BoTTA: Benchmarking on-device Test Time Adaptation

Michal Danilowski, Soumyajit Chatterjee, Abhirup Ghosh

TL;DR

BoTTA presents a practical benchmark for evaluating Test Time Adaptation (TTA) under real-world on-device constraints, addressing gaps where prior benchmarks overlook limited data, partial category exposure, diverse and overlapping shifts, and hardware resource limits. It defines five edge-focused scenarios, leverages CIFAR-10C and PACS with ResNet-26/ResNet-50 and ViT architectures, and evaluates methods like TENT, SAR, SHOT, NOTE, and T3A on Raspberry Pi 4B and Jetson Orin Nano with periodic adaptation versus continuous inference-time adaptation. The study finds SHOT often delivers the strongest gains in limited-data settings but struggles with diverse or multi-domain shifts and incurs higher memory usage, while T3A often offers memory savings at the cost of limited accuracy gains. Profiling on-device resources reveals substantial memory and compute demands for many TTA approaches, underscoring the need for edge-tailored algorithms. BoTTA thus provides actionable guidance and a standardized evaluation suite to advance robust, resource-efficient on-device adaptation for real-world deployments.

Abstract

The performance of deep learning models depends heavily on test samples at runtime, and shifts from the training data distribution can significantly reduce accuracy. Test-time adaptation (TTA) addresses this by adapting models during inference without requiring labeled test data or access to the original training set. While research has explored TTA from various perspectives like algorithmic complexity, data and class distribution shifts, model architectures, and offline versus continuous learning, constraints specific to mobile and edge devices remain underexplored. We propose BoTTA, a benchmark designed to evaluate TTA methods under practical constraints on mobile and edge devices. Our evaluation targets four key challenges caused by limited resources and usage conditions: (i) limited test samples, (ii) limited exposure to categories, (iii) diverse distribution shifts, and (iv) overlapping shifts within a sample. We assess state-of-the-art TTA methods under these scenarios using benchmark datasets and report system-level metrics on a real testbed. Furthermore, unlike prior work, we align with on-device requirements by advocating periodic adaptation instead of continuous inference-time adaptation. Experiments reveal key insights: many recent TTA algorithms struggle with small datasets, fail to generalize to unseen categories, and depend on the diversity and complexity of distribution shifts. BoTTA also reports device-specific resource use. For example, while SHOT improves accuracy by $2.25\times$ with $512$ adaptation samples, it uses $1.08\times$ peak memory on Raspberry Pi versus the base model. BoTTA offers actionable guidance for TTA in real-world, resource-constrained deployments.

BoTTA: Benchmarking on-device Test Time Adaptation

TL;DR

BoTTA presents a practical benchmark for evaluating Test Time Adaptation (TTA) under real-world on-device constraints, addressing gaps where prior benchmarks overlook limited data, partial category exposure, diverse and overlapping shifts, and hardware resource limits. It defines five edge-focused scenarios, leverages CIFAR-10C and PACS with ResNet-26/ResNet-50 and ViT architectures, and evaluates methods like TENT, SAR, SHOT, NOTE, and T3A on Raspberry Pi 4B and Jetson Orin Nano with periodic adaptation versus continuous inference-time adaptation. The study finds SHOT often delivers the strongest gains in limited-data settings but struggles with diverse or multi-domain shifts and incurs higher memory usage, while T3A often offers memory savings at the cost of limited accuracy gains. Profiling on-device resources reveals substantial memory and compute demands for many TTA approaches, underscoring the need for edge-tailored algorithms. BoTTA thus provides actionable guidance and a standardized evaluation suite to advance robust, resource-efficient on-device adaptation for real-world deployments.

Abstract

The performance of deep learning models depends heavily on test samples at runtime, and shifts from the training data distribution can significantly reduce accuracy. Test-time adaptation (TTA) addresses this by adapting models during inference without requiring labeled test data or access to the original training set. While research has explored TTA from various perspectives like algorithmic complexity, data and class distribution shifts, model architectures, and offline versus continuous learning, constraints specific to mobile and edge devices remain underexplored. We propose BoTTA, a benchmark designed to evaluate TTA methods under practical constraints on mobile and edge devices. Our evaluation targets four key challenges caused by limited resources and usage conditions: (i) limited test samples, (ii) limited exposure to categories, (iii) diverse distribution shifts, and (iv) overlapping shifts within a sample. We assess state-of-the-art TTA methods under these scenarios using benchmark datasets and report system-level metrics on a real testbed. Furthermore, unlike prior work, we align with on-device requirements by advocating periodic adaptation instead of continuous inference-time adaptation. Experiments reveal key insights: many recent TTA algorithms struggle with small datasets, fail to generalize to unseen categories, and depend on the diversity and complexity of distribution shifts. BoTTA also reports device-specific resource use. For example, while SHOT improves accuracy by with adaptation samples, it uses peak memory on Raspberry Pi versus the base model. BoTTA offers actionable guidance for TTA in real-world, resource-constrained deployments.

Paper Structure

This paper contains 25 sections, 28 figures, 2 tables.

Figures (28)

  • Figure 1: On-device domain model adaptation settings with an example bird classifier. The pre-trained model (on clean images) is deployed on a phone. The pictures of the birds that the user takes are noisy, e.g., blurred. The model is adapted using the target data and remains frozen for inference thereafter.
  • Figure 2:
  • Figure 3:
  • Figure 5:
  • Figure 6:
  • ...and 23 more figures