Goal-Oriented Influence-Maximizing Data Acquisition for Learning and Optimization

Weichi Yao; Bianca Dumitrascu; Bryan R. Goldsmith; Yixin Wang

Goal-Oriented Influence-Maximizing Data Acquisition for Learning and Optimization

Weichi Yao, Bianca Dumitrascu, Bryan R. Goldsmith, Yixin Wang

TL;DR

Goal-Oriented Influence-Maximizing Data Acquisition (GOIMDA) presents a posterior-free, uncertainty-aware framework for active data acquisition that directly targets a user-defined goal functional $\mathcal{G}$ via a first-order influence score. By incorporating an inverse-Hessian curvature preconditioner, a gradient of the goal, and candidate sensitivity, GOIMDA achieves exploration–exploitation balance without Bayesian posterior maintenance, and aligns acquisition with the specific scientific objective. Theoretical results under exponential-family models reveal a link to predictive-entropy minimization, modulated by goal alignment and prediction bias, while practical implementations use Jackknife deep ensembles and implicit inverse-Hessian vector products for scalability. Empirically, GOIMDA consistently outperforms uncertainty-based AL and GP-based BO across noisy optimization, hyperparameter tuning under distribution shift, and predictive learning tasks (MNIST, EMNIST, Rotten Tomatoes), often with substantially fewer labeled points or function evaluations. The approach offers a versatile, scalable alternative to Bayesian uncertainty-based acquisition with broad applicability to learning and optimization challenges.

Abstract

Active data acquisition is central to many learning and optimization tasks in deep neural networks, yet remains challenging because most approaches rely on predictive uncertainty estimates that are difficult to obtain reliably. To this end, we propose Goal-Oriented Influence- Maximizing Data Acquisition (GOIMDA), an active acquisition algorithm that avoids explicit posterior inference while remaining uncertainty-aware through inverse curvature. GOIMDA selects inputs by maximizing their expected influence on a user-specified goal functional, such as test loss, predictive entropy, or the value of an optimizer-recommended design. Leveraging first-order influence functions, we derive a tractable acquisition rule that combines the goal gradient, training-loss curvature, and candidate sensitivity to model parameters. We show theoretically that, for generalized linear models, GOIMDA approximates predictive-entropy minimization up to a correction term accounting for goal alignment and prediction bias, thereby, yielding uncertainty-aware behavior without maintaining a Bayesian posterior. Empirically, across learning tasks (including image and text classification) and optimization tasks (including noisy global optimization benchmarks and neural-network hyperparameter tuning), GOIMDA consistently reaches target performance with substantially fewer labeled samples or function evaluations than uncertainty-based active learning and Gaussian-process Bayesian optimization baselines.

Goal-Oriented Influence-Maximizing Data Acquisition for Learning and Optimization

TL;DR

Goal-Oriented Influence-Maximizing Data Acquisition (GOIMDA) presents a posterior-free, uncertainty-aware framework for active data acquisition that directly targets a user-defined goal functional

via a first-order influence score. By incorporating an inverse-Hessian curvature preconditioner, a gradient of the goal, and candidate sensitivity, GOIMDA achieves exploration–exploitation balance without Bayesian posterior maintenance, and aligns acquisition with the specific scientific objective. Theoretical results under exponential-family models reveal a link to predictive-entropy minimization, modulated by goal alignment and prediction bias, while practical implementations use Jackknife deep ensembles and implicit inverse-Hessian vector products for scalability. Empirically, GOIMDA consistently outperforms uncertainty-based AL and GP-based BO across noisy optimization, hyperparameter tuning under distribution shift, and predictive learning tasks (MNIST, EMNIST, Rotten Tomatoes), often with substantially fewer labeled points or function evaluations. The approach offers a versatile, scalable alternative to Bayesian uncertainty-based acquisition with broad applicability to learning and optimization challenges.

Abstract

Paper Structure (34 sections, 2 theorems, 54 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 34 sections, 2 theorems, 54 equations, 4 figures, 1 table, 1 algorithm.

Introduction
Goal-Oriented Influence-Maximizing Data Acquisition
Goal Objective Function
Influence Function
Goal-based informativeness
Influence function approximation
Closed-form influence score
Practical and Scalable Implementation
Theoretical Properties of GOIMDA under Exponential Family Models
Example Applications of GOIMDA
Example Application I: Iterative Global Optimization with Noisy Observations
Example Application II: Hyperparameter Optimization
Example Application III: Deep Active Learning
Empirical Studies
On the importance of the parameter-bias term in (\ref{['eq:goi_exp_breakdown_param_bias']})
...and 19 more sections

Key Result

Proposition 2

The influence score of a candidate point $(x_{\mathrm{c}}, y_{\mathrm{c}})$ on the goal minimization objective $\mathcal{G}$, defined in (eq:influence_function_on_goal_minimization), approximates the reduction in $\mathcal{G}$ after adding $x_{\mathrm c}$ to $\mathcal{D}$.

Figures (4)

Figure 1: GOIMDA reaches lower immediate regret with fewer acquisitions than Bayesian optimization baselines on noisy objective functions. The Bayesian optimization baselines are Gaussian Processes-based with acquisition functions: upper confidence bound (GP+UCB), expected improvement (GP+EI), probability of improvement (GP+PI), max-value entropy search (GP+MES), and knowledge gradient (GP+KG). Immediate regret is reported at each acquisition step for the Branin, Ackley, and Dropwave benchmarks under two noise levels ($\sigma^2=0.01,0.04$). Solid/dashed curves show the mean performance across runs, and shaded regions denote bootstrapped 95% confidence intervals of the mean (computed across runs). Across all tasks, GOIMDA consistently achieves lower regret earlier in the acquisition process, with the advantage generally becoming more pronounced at higher noise levels.
Figure 2: The parameter-bias term is crucial for effective data acquisition, and a good approximation significantly improves the acquisition performance. Using the true bias term (blue) reaches high accuracy with fewer acquisitions, and a principled approximation (orange) outperforms ignoring the bias term (dark gray dashed). The inset zoom highlights the persistent gap. Shaded regions indicate variability across runs. Test accuracy is shown versus the number of acquired labels (averaged over 200 repetitions). Higher is better.
Figure 3: GOIMDA selects the hyperparameters that provide the best prediction performance on the target test set compared to Bayesian optimization methods. The performance is evaluated in terms of the $\Delta$ test regret relative to the initial configuration (step 0) across acquisition steps. Solid line shows GOIMDA; dashed lines show Bayesian optimization baselines. Shaded regions indicate 95% bootstrap confidence intervals of the mean across 3 trials. Negative values correspond to improved target-set performance.
Figure 4: GOIMDA outperforms both random acquisition and BALD on classification tasks on images of digits from MNIST (Left) and EMNIST (Middle), and the sentiment of movie reviews from the Rotten Tomatoes dataset (Right). The test accuracy is evaluated at each acquisition step. Left: GOIMDA outperforms both random acquisition and BALD on MNIST. Middle: GOIMDA outperforms both random acquisition and BALD on EMNIST, whereas BALD only performs slightly better than random acquisition. Right: Both GOIMDA and BALD outperform the random sampling in terms of the accuracy rate at a given acquired dataset size. While GOIMDA and BALD have similar performance, GOIMDA gives slightly higher accuracy rates after the total acquisition size reaches $1,000$.

Theorems & Definitions (5)

Definition 1: Informativeness
Proposition 2
proof
Proposition 3
proof

Goal-Oriented Influence-Maximizing Data Acquisition for Learning and Optimization

TL;DR

Abstract

Goal-Oriented Influence-Maximizing Data Acquisition for Learning and Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (5)