POP: Prior-fitted Optimizer Policies

Jan Kobiolka; Christian Frey; Gresa Shala; Arlind Kadra; Erind Bedalli; Josif Grabocka

POP: Prior-fitted Optimizer Policies

Jan Kobiolka, Christian Frey, Gresa Shala, Arlind Kadra, Erind Bedalli, Josif Grabocka

TL;DR

POP (Prior-fitted Optimizer Policies), a meta-learned optimizer that predicts coordinate-wise step sizes conditioned on the contextual information provided in the optimization trajectory, demonstrates strong generalization capabilities without task-specific tuning.

Abstract

Optimization refers to the task of finding extrema of an objective function. Classical gradient-based optimizers are highly sensitive to hyperparameter choices. In highly non-convex settings their performance relies on carefully tuned learning rates, momentum, and gradient accumulation. To address these limitations, we introduce POP (Prior-fitted Optimizer Policies), a meta-learned optimizer that predicts coordinate-wise step sizes conditioned on the contextual information provided in the optimization trajectory. Our model is learned on millions of synthetic optimization problems sampled from a novel prior spanning both convex and non-convex objectives. We evaluate POP on an established benchmark including 47 optimization functions of various complexity, where it consistently outperforms first-order gradient-based methods, non-convex optimization approaches (e.g., evolutionary strategies), Bayesian optimization, and a recent meta-learned competitor under matched budget constraints. Our evaluation demonstrates strong generalization capabilities without task-specific tuning.

POP: Prior-fitted Optimizer Policies

TL;DR

Abstract

Paper Structure (30 sections, 29 equations, 17 figures, 6 tables)

This paper contains 30 sections, 29 equations, 17 figures, 6 tables.

Introduction
POP: Prior-fitted Optimizer
MDP for Optimization
Optimization Step
Reward Signal
Meta-Learning Objective
A Prior for Optimization Problems
In-Distribution Generalization
Experimental Protocol
Architecture and Training
Benchmarks
Baselines
Evaluation
Hypothesis 1: Learned optimizers trained on synthetic priors can generalize to unseen problems from the same distribution.
Hypothesis 2: Learned coordinate optimizers trained on a low-dimensional (2D) prior for a fixed number of iterations generalizes to longer optimization horizons and higher-dimensional problems.
...and 15 more sections

Figures (17)

Figure 1: POP (blue) adapts its learning rate based on the optimization landscape, enabling rapid convergence, escape from local minima, and improved global optimization compared to Adam (red). Yellow diamond represents the global minima, the white cross represents the start state, and the square represents the end state.
Figure 2: Meta-learning reward and evaluation validation performance of our POP agent.
Figure 3: In-distribution test set performance vs. baselines. Mean normalized improvement over steps; shading indicates 95% CIs. Dashed line marks the context/optimization boundary.
Figure 4: Method rankings on the in-distribution test set at 100% budget. Lower ranks correspond to better performance, while horizontal bars indicate differences that are not statistically significant.
Figure 5: In-distribution test set performance vs. baselines at twice the training budget. Mean normalized improvement over steps; shading indicates 95% CIs. Dashed line marks the context/optimization boundary.
...and 12 more figures

Theorems & Definitions (1)

Definition 2.1: Optimization trajectory

POP: Prior-fitted Optimizer Policies

TL;DR

Abstract

POP: Prior-fitted Optimizer Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (17)

Theorems & Definitions (1)