Test-time Assessment of a Model's Performance on Unseen Domains via Optimal Transport

Akshay Mehra; Yunbei Zhang; Jihun Hamm

Test-time Assessment of a Model's Performance on Unseen Domains via Optimal Transport

Akshay Mehra, Yunbei Zhang, Jihun Hamm

TL;DR

This work addresses test-time estimation of model transferability to unseen domains by introducing TETOT, an Optimal Transport–based metric that leverages both source-domain data (or statistics) and unlabeled target-domain samples. TETOT computes a distributional distance between source and target using a base distance that fuses feature-space discrepancy, $c_{features} = ||g(x_S) - g(x_T)||_2$, with a label discrepancy, $c_{labels} = ||y_S - h(g(x_T))||_2$, weighted by a parameter $\\lambda$, and computes the OT distance between the resulting distributions. Across PACS and VLCS and their corruptions, TETOT achieves higher negative correlation with actual transferability than a widely used entropy-based metric, enabling practical use in architecture selection, source-domain selection, and predicting performance on unseen domains at test time. The method remains feasible with limited data and supports a source-data–free variant using only statistics, expanding applicability in privacy-conscious or storage-constrained environments.

Abstract

Gauging the performance of ML models on data from unseen domains at test-time is essential yet a challenging problem due to the lack of labels in this setting. Moreover, the performance of these models on in-distribution data is a poor indicator of their performance on data from unseen domains. Thus, it is essential to develop metrics that can provide insights into the model's performance at test time and can be computed only with the information available at test time (such as their model parameters, the training data or its statistics, and the unlabeled test data). To this end, we propose a metric based on Optimal Transport that is highly correlated with the model's performance on unseen domains and is efficiently computable only using information available at test time. Concretely, our metric characterizes the model's performance on unseen domains using only a small amount of unlabeled data from these domains and data or statistics from the training (source) domain(s). Through extensive empirical evaluation using standard benchmark datasets, and their corruptions, we demonstrate the utility of our metric in estimating the model's performance in various practical applications. These include the problems of selecting the source data and architecture that leads to the best performance on data from an unseen domain and the problem of predicting a deployed model's performance at test time on unseen domains. Our empirical results show that our metric, which uses information from both the source and the unseen domain, is highly correlated with the model's performance, achieving a significantly better correlation than that obtained via the popular prediction entropy-based metric, which is computed solely using the data from the unseen domain.

Test-time Assessment of a Model's Performance on Unseen Domains via Optimal Transport

TL;DR

, with a label discrepancy,

, weighted by a parameter

, and computes the OT distance between the resulting distributions. Across PACS and VLCS and their corruptions, TETOT achieves higher negative correlation with actual transferability than a widely used entropy-based metric, enabling practical use in architecture selection, source-domain selection, and predicting performance on unseen domains at test time. The method remains feasible with limited data and supports a source-data–free variant using only statistics, expanding applicability in privacy-conscious or storage-constrained environments.

Abstract

Paper Structure (15 sections, 6 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 6 equations, 4 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Test-time assessment of transferability
Notation and problem setting
Background on Optimal Transport (OT)
Estimating transferability via TETOT
Experiments
Architecture selection for a target domain
Source dataset selection for a target domain
Assessing transferability to unseen domains
Effect of sample size on TETOT
Estimating transferability without source data
Effect of different $\lambda$ in Eq. \ref{['eq:final_distance']} for TETOT
Conclusion
Acknowledgment

Figures (4)

Figure 1: (Best viewed in color.) Overview of the practical applications for which TETOT can be utilized. The first application (left) is to identify the model architecture that will yield the highest transferability for a particular target domain. The second application (center) is to identify the best source domain data that will produce a model with the highest transferability to the target domain. The third application (right) is to assess the performance of a given model on unseen target domains given only unlabeled data from those domains.
Figure 2: (Best viewed in color.) The superiority of TETOT (top row) in achieving a high (negative) correlation ($\rho$ in the plot titles) with transferability compared to prediction entropy (bottom row) for selecting the best model architecture for making predictions on a given target domain at test time. Models are trained using Cartoon (C) as the source domain (in the single domain setting) and Art (A), Photos (P), and Sketch (S) (in a multi-domain setting) from PACS and evaluated on various target domains.
Figure 3: (Best viewed in color.) The superiority of TETOT (top row) compared to prediction entropy (bottom row) in achieving a high (negative) correlation ($\rho$ in the plot titles) with transferability on unseen domains encountered at test time. Models are trained using multiple source domains and evaluated on an unseen target domain from the PACS dataset. (The black triangle denotes the original data of the target domain whereas the colored triangles denote the target domain data corrupted by different corruptions and severity levels)
Figure 4: The Pearson correlation coefficient between transferability and TETOT remains high and better than entropy (a higher negative correlation is better i.e., a smaller number is better) for different sample sizes of the source and target domains.

Theorems & Definitions (1)

Definition 1

Test-time Assessment of a Model's Performance on Unseen Domains via Optimal Transport

TL;DR

Abstract

Test-time Assessment of a Model's Performance on Unseen Domains via Optimal Transport

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (1)