Table of Contents
Fetching ...

Optimal control of the future via prospective learning with control

Yuxin Bai, Aranyak Acharyya, Ashwin De Silva, Zeyu Shen, James Hassett, Joshua T. Vogelstein

TL;DR

This work introduces Prospective Learning with Control (PL+C), a supervised-learning–based framework for learning in non-stationary, reset-free environments where actions influence future dynamics. It proves that ERM can asymptotically achieve Bayes-optimal policy under mild assumptions and provides ProForg, an online algorithm with warm-start, separate estimators for instantaneous and cumulative losses, and lookahead-based inference. Through Prospective Foraging, the paper demonstrates that ProForg learns far more efficiently than standard RL baselines (e.g., FQI, SAC) in non-stationary settings and can operate online with strong performance guarantees. Theoretical results establish convergence to Bayes optimality in expectation, and empirical findings suggest PL+C offers a viable path toward robust, future-oriented control in natural and artificial agents.

Abstract

Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in reinforcement learning (RL). RL is mathematically distinct from supervised learning, which has been the main workhorse for the recent achievements in AI. Moreover, RL typically operates in a stationary environment with episodic resets, limiting its utility. Here, we extend supervised learning to address learning to \textit{control} in non-stationary, reset-free environments. Using this framework, called ''Prospective Learning with Control'' (PL+C), we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy. We then consider a specific instance of prospective learning with control, foraging -- which is a canonical task for any mobile agent -- be it natural or artificial. We illustrate that modern RL algorithms fail to learn in these non-stationary reset-free environments, and even with modifications, they are orders of magnitude less efficient than our prospective foraging agents.

Optimal control of the future via prospective learning with control

TL;DR

This work introduces Prospective Learning with Control (PL+C), a supervised-learning–based framework for learning in non-stationary, reset-free environments where actions influence future dynamics. It proves that ERM can asymptotically achieve Bayes-optimal policy under mild assumptions and provides ProForg, an online algorithm with warm-start, separate estimators for instantaneous and cumulative losses, and lookahead-based inference. Through Prospective Foraging, the paper demonstrates that ProForg learns far more efficiently than standard RL baselines (e.g., FQI, SAC) in non-stationary settings and can operate online with strong performance guarantees. Theoretical results establish convergence to Bayes optimality in expectation, and empirical findings suggest PL+C offers a viable path toward robust, future-oriented control in natural and artificial agents.

Abstract

Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in reinforcement learning (RL). RL is mathematically distinct from supervised learning, which has been the main workhorse for the recent achievements in AI. Moreover, RL typically operates in a stationary environment with episodic resets, limiting its utility. Here, we extend supervised learning to address learning to \textit{control} in non-stationary, reset-free environments. Using this framework, called ''Prospective Learning with Control'' (PL+C), we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy. We then consider a specific instance of prospective learning with control, foraging -- which is a canonical task for any mobile agent -- be it natural or artificial. We illustrate that modern RL algorithms fail to learn in these non-stationary reset-free environments, and even with modifications, they are orders of magnitude less efficient than our prospective foraging agents.

Paper Structure

This paper contains 23 sections, 4 theorems, 40 equations, 4 figures, 3 algorithms.

Key Result

Lemma 5.1

Suppose $\lbrace Z_t \rbrace_{t=1}^{\infty}$ is a stochastic process and let $\lbrace \mathcal{H}_t \rbrace_{t=1}^{\infty}$ be an increasing hypothesis class, such that there exists $h^{(t)} \in \mathcal{H}_t$ which satisfies $\lim_{t \to \infty} \mathbb{E}[R_t(h^{(t)})-R_t^*]=0.$ Additionally, supp

Figures (4)

  • Figure 1: ProForg efficiently achieves Bayes optimal regret. Normalized prospective regret of ProForg (red), time-aware Fitted Q-Iteration (FQI with time, blue-purple, our invention to improve FQI), Time-agnostic Fitted Q-Iteration(FQI w/o time, light-blue ernst2005tree), time-aware Soft Actor-Critic (SAC with time, purple-red), and Time-agnostic Soft Actor-Critic(SAC w/o time, lavender haarnoja2018soft). While ProForg, time-aware FQI, and time-aware SAC converge to having zero regret, ProForg is orders of magnitude more efficient than either of them. And time-agnostic variants converge to a sub-optimal regret regardless of the time spent interaction.
  • Figure 2: ProForg online is several fold more efficient than offline. Normalized prospective regret for ProForg for online (red) and offline (pink). After warm-starting with 200 time steps, the online one converges in 20 time steps, whereas the offline one requires about 4$\times$ more data to converge.
  • Figure 3: Normalized prospective regret for ProForg(red), ProForg-I (orange), and ProForg-C (yellow).Removing either component reduces performance relative to ProForg.
  • Figure 4: ProForg with decision forests is 4x more efficient than with neural networks. Normalized prospective regret for ProForg with Gradient-Boosted Trees (red) and MLP Regressor (blue). While ProForg is 4x more efficient, ProForg-NN does converge as well.

Theorems & Definitions (4)

  • Lemma 5.1
  • Theorem 5.1
  • Lemma B.1
  • Theorem B.1