MePoly: Max Entropy Polynomial Policy Optimization

Hang Liu; Sangli Teng; Maani Ghaffari

MePoly: Max Entropy Polynomial Policy Optimization

Hang Liu, Sangli Teng, Maani Ghaffari

TL;DR

MePoly is proposed, a novel policy parameterization based on polynomial energy-based models that provides an explicit, tractable probability density, enabling exact entropy maximization inochastic Optimal Control.

Abstract

Stochastic Optimal Control provides a unified mathematical framework for solving complex decision-making problems, encompassing paradigms such as maximum entropy reinforcement learning(RL) and imitation learning(IL). However, conventional parametric policies often struggle to represent the multi-modality of the solutions. Though diffusion-based policies are aimed at recovering the multi-modality, they lack an explicit probability density, which complicates policy-gradient optimization. To bridge this gap, we propose MePoly, a novel policy parameterization based on polynomial energy-based models. MePoly provides an explicit, tractable probability density, enabling exact entropy maximization. Theoretically, we ground our method in the classical moment problem, leveraging the universal approximation capabilities for arbitrary distributions. Empirically, we demonstrate that MePoly effectively captures complex non-convex manifolds and outperforms baselines in performance across diverse benchmarks.

MePoly: Max Entropy Polynomial Policy Optimization

TL;DR

Abstract

Paper Structure (40 sections, 2 theorems, 19 equations, 4 figures, 1 table)

This paper contains 40 sections, 2 theorems, 19 equations, 4 figures, 1 table.

Introduction
Max Entropy Policy Optimization
Multimodal Parameterization
Preliminaries
Problem Formulation: MDPs
Maximum Entropy Framework for RL and IL
MaxEnt Reinforcement Learning.
MaxEnt Imitation Learning.
Unified View.
Polynomial Distribution Policy
Polynomial Distribution
Monomial Basis Definition.
Polynomial Energy and Joint Distribution.
Legendre Polynomials.
Energy-Based Parameterization.
...and 25 more sections

Key Result

Theorem 3.4

Consider the Max-Entropy distribution $p_K^*(a)$ constrained by moments up to order $K$. As the polynomial order $K \to \infty$, there exist unique limit $p_{\infty}$ where the distribution $p_K^*(a)$ converges to in the $L_1$ norm:

Figures (4)

Figure 1: Conceptual comparison of policy parameterizations on a non-convex action manifold. (A) A unimodal Gaussian concentrates mass around a single mean, leading to limited expressivity and mode collapse. (B) Diffusion can represent complex supports via iterative sampling but lacks a tractable likelihood and requires multi-step generation. (C) MePoly yields an expressive, explicit density that can conform to the manifold while retaining tractable log-probabilities and entropies for learning.
Figure 2: Trajectory samples on multi-modal navigation tasks: Each column is a different environment. Each row corresponds to various methods: MePoly (ours), PPO (Gaussian), PPO (Gaussian Mixture Model), and FPO (Flow-Matching). In every panel, we visualize multiple rollouts from the same start state (white circle) under identical environment layouts: walls/obstacles are shown in gray, goal regions in green, and death/unsafe regions in red (see legend). Trajectories are color-coded by time (purple → yellow), and black markers indicate terminal positions. MePoly consistently produces diverse, distinct feasible routes that cover multiple classes (e.g., different goals and passages through slits/around obstacles), whereas baselines often collapse to a single mode or fail to represent alternative valid solutions in highly non-convex environments.
Figure 3: Myth of Mepoly: The LHS shows that Mepoly asymptotically approaches the sample distribution as the order increases, which validates Theorem \ref{['thm:asymptotic']} and Corollary \ref{['cor:universal_approx']}. The RHS replaces Legendre Polynomials with standard Polynomials, samples become noticeably blurrier and less faithful, indicating that orthogonal bases are critical for stable and accurate manifold approximation.
Figure 4: Sample quality comparison: MePoly (ours) generates samples that closely follow the underlying curved support and preserve the global topology. In contrast, the baselines fail to capture the non-convex and multi-modal structures, resulting in significant geometric mismatch and a collapse to unimodal or fragmented approximations.

Theorems & Definitions (4)

Definition 3.1: Moment-Determinate Distribution schmudgen2017moment
Remark 3.3
Theorem 3.4: Asymptotic Approximation borwein1991convergenceteng2025max
Corollary 3.5: Universal Representational Capacity

MePoly: Max Entropy Polynomial Policy Optimization

TL;DR

Abstract

MePoly: Max Entropy Polynomial Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (4)