Adaptive Learn-then-Test: Statistically Valid and Efficient Hyperparameter Selection

Matteo Zecchin; Sangwoo Park; Osvaldo Simeone

Adaptive Learn-then-Test: Statistically Valid and Efficient Hyperparameter Selection

Matteo Zecchin, Sangwoo Park, Osvaldo Simeone

TL;DR

Adaptive Learn-Then-Test (aLTT) tackles hyperparameter selection by combining sequential, data-dependent MHT with e-processes to enable anytime valid testing and early stopping. Unlike traditional LTT, aLTT supports adaptive acquisition and reduces the number of testing rounds while preserving finite-sample guarantees under $(\alpha,\delta)$-FWER or -FDR control. The approach is demonstrated in online policy selection for offline reinforcement learning and reliable automated prompt engineering, where it achieves comparable performance to LTT with far fewer testing rounds and yields shorter, more efficient prompts. The framework promises safer, more data-efficient calibration of AI apps, with potential extensions to distribution shifts and simulation-aided calibration.

Abstract

We introduce adaptive learn-then-test (aLTT), an efficient hyperparameter selection procedure that provides finite-sample statistical guarantees on the population risk of AI models. Unlike the existing learn-then-test (LTT) technique, which relies on conventional p-value-based multiple hypothesis testing (MHT), aLTT implements sequential data-dependent MHT with early termination by leveraging e-processes. As a result, aLTT can reduce the number of testing rounds, making it particularly well-suited for scenarios in which testing is costly or presents safety risks. Apart from maintaining statistical validity, in applications such as online policy selection for offline reinforcement learning and prompt engineering, aLTT is shown to achieve the same performance as LTT while requiring only a fraction of the testing rounds.

Adaptive Learn-then-Test: Statistically Valid and Efficient Hyperparameter Selection

TL;DR

-FWER or -FDR control. The approach is demonstrated in online policy selection for offline reinforcement learning and reliable automated prompt engineering, where it achieves comparable performance to LTT with far fewer testing rounds and yields shorter, more efficient prompts. The framework promises safer, more data-efficient calibration of AI apps, with potential extensions to distribution shifts and simulation-aided calibration.

Abstract

Paper Structure (39 sections, 21 equations, 12 figures)

This paper contains 39 sections, 21 equations, 12 figures.

Introduction
Context and Motivation
Related Work
Main Contributions
Problem Definition
Setting
Performance Criteria
Sequential and Adaptive Hyperparameter Selection
(Non-Adaptive) Learn-then-Test
Adaptive Learn-Then-Test
Hypothesis Testing via E-Processes
Adaptive Acquisition Policy
Hyperparameter Subset Selection
Applications
Online Policy Selection for Offline Reinforcement Learning
...and 24 more sections

Figures (12)

Figure 1: An example application of aLTT to reliable prompt optimization zhou2023largequach2023conformalschneider2024hyperband. A set $\Lambda$ of candidate prompts for a movie recommender is generated using an LLM and/or prior experience. Prompts serve as an example of a discrete set of hyperparameters to be optimized using aLTT. The goal is to select a subset $\hat{\Lambda}^{\rm rel} \subseteq \Lambda$ of prompts that guarantee a sufficiently high recommendation accuracy. To this end, aLTT applies a sequence of data-dependent testing rounds with adaptive termination. Specifically, at each testing round $t$, aLTT estimates the performance of a subset of hyperparameters $\mathcal{I}^t \subseteq \Lambda$ through held-out data or real-world testing. The subset to be tested is selected based on prior testing outcomes, and the process stops as soon as a sufficiently large reliable subset $\hat{\Lambda}^{\rm rel}$ is identified. An additional post-calibration selection step can be applied to choose a single hyperparameter $\hat{\lambda}$ from the selected subset $\hat{\Lambda}^{\rm rel}$ based on users preferences.
Figure 2: True positive rate of LTT and aLTT with $\epsilon$-greedy acquisition policy for $\epsilon\in\{0.25,0.5,0.75,0.95\}$ and non-adaptive acquisition. On the left panel, the prediction sets satisfy FWER control while on the right FDR control. In both cases, the tolerance level is $\delta=0.1$.
Figure 3: Comparison of the FWER and FDR levels obtained by aLTT under FDR-control (solid lines) and FWER-control (dashed lines) for different maximum tolerated error (FWER or FDR) levels $\delta$.
Figure 4: True positive rate of LTT and aLTT with $\epsilon$-greedy acquisition policy for $\epsilon\in\{0.25,0.5,0.75,0.95\}$ and non-adaptive acquisition. On the left panel, the prediction sets satisfy FWER control, while on the right panel they meet FDR requirements. In both cases, the tolerance level is $\delta=0.1$. Results are averaged over the tasks in honovich2022instruction that yield non-empty reliable prompts set.
Figure 5: Length of the shortest instruction in the predicted set of reliable hyperparameter $\hat{\Lambda}^{\rm rel}$ returned by LTT and aLTT. Instructions are tested under different accuracy requirements and with a fixed testing budget $T=2000$.
...and 7 more figures

Theorems & Definitions (2)

Definition 2.1: $(\alpha,\delta)$-FWER-controlling set
Definition 2.2: $(\alpha,\delta)$-FDR-controlling set

Adaptive Learn-then-Test: Statistically Valid and Efficient Hyperparameter Selection

TL;DR

Abstract

Adaptive Learn-then-Test: Statistically Valid and Efficient Hyperparameter Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (12)

Theorems & Definitions (2)