DPO: A Differential and Pointwise Control Approach to Reinforcement Learning

Minh Nguyen; Chandrajit Bajaj

DPO: A Differential and Pointwise Control Approach to Reinforcement Learning

Minh Nguyen, Chandrajit Bajaj

TL;DR

The paper addresses sample-inefficiency and lack of physical consistency in reinforcement learning for scientific computing by reframing RL as a continuous-time control problem via a differential dual and Hamiltonian structure. It introduces Differential Policy Optimization (DPO), a stagewise, pointwise update rule that learns a local trajectory operator, promoting trajectory-consistent learning aligned with system dynamics. The authors establish pointwise convergence guarantees and a regret bound of $O(K^{5/6})$, and demonstrate empirically that DPO outperforms standard baselines on surface modeling, grid-based modeling, and molecular dynamics under low data. This approach integrates physics priors into RL through a differential dual, enabling more data-efficient learning in physics-constrained environments with broad potential impact in scientific computing. The work also provides a foundation for future extensions to adaptive discretization and broader domains beyond physics-informed control.

Abstract

Reinforcement learning (RL) in continuous state-action spaces remains challenging in scientific computing due to poor sample efficiency and lack of pathwise physical consistency. We introduce Differential Reinforcement Learning (Differential RL), a novel framework that reformulates RL from a continuous-time control perspective via a differential dual formulation. This induces a Hamiltonian structure that embeds physics priors and ensures consistent trajectories without requiring explicit constraints. To implement Differential RL, we develop Differential Policy Optimization (DPO), a pointwise, stage-wise algorithm that refines local movement operators along the trajectory for improved sample efficiency and dynamic alignment. We establish pointwise convergence guarantees, a property not available in standard RL, and derive a competitive theoretical regret bound of $O(K^{5/6})$. Empirically, DPO outperforms standard RL baselines on representative scientific computing tasks, including surface modeling, grid control, and molecular dynamics, under low-data and physics-constrained conditions.

DPO: A Differential and Pointwise Control Approach to Reinforcement Learning

TL;DR

, and demonstrate empirically that DPO outperforms standard baselines on surface modeling, grid-based modeling, and molecular dynamics under low data. This approach integrates physics priors into RL through a differential dual, enabling more data-efficient learning in physics-constrained environments with broad potential impact in scientific computing. The work also provides a foundation for future extensions to adaptive discretization and broader domains beyond physics-informed control.

Abstract

. Empirically, DPO outperforms standard RL baselines on representative scientific computing tasks, including surface modeling, grid control, and molecular dynamics, under low-data and physics-constrained conditions.

Paper Structure (19 sections, 12 theorems, 58 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 19 sections, 12 theorems, 58 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Differential reinforcement learning
Problem formulation
Differential policy optimization (DPO) algorithm
Application to scientific computing
Theoretical analysis
Pointwise convergence and sample complexity
Regret bound analysis
Experiments
Evaluation tasks
Experimental results
Conclusion
Basic algorithmic learning theory
Proofs of theorems and corollaries in Section 3
Computational details
...and 4 more sections

Key Result

Theorem 3.2

Suppose that we are given a threshold error $\epsilon$, a probability threshold $\delta$, and a number of steps per episode $H$. Assume that $\left\{ N_k \right\}_{k = 1}^{H-1}$ is the sequence of numbers of samples used at each stage in alg:dpo (DPO) so that: Here $\delta_k = \delta/3^{H-k} = 3 \delta_{k-1}$. We further assume that there exists a Lipschitz constant $L > 0$ such that both the tru

Figures (4)

Figure : (a) Surface modeling
Figure : (a) Surface modeling
Figure : (b) Grid-based modeling
Figure : (c) Molecular dynamics

Theorems & Definitions (25)

Definition 2.1
Definition 3.1
Theorem 3.2
proof
Corollary 3.3
Corollary 3.4
Corollary 3.5
proof
Corollary 3.6
proof
...and 15 more

DPO: A Differential and Pointwise Control Approach to Reinforcement Learning

TL;DR

Abstract

DPO: A Differential and Pointwise Control Approach to Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (25)