From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation
Abdelhakim Benechehab, Gabriel Singer, Corentin Léger, Youssef Attia El Hili, Giuseppe Paolo, Albert Thomas, Maurizio Filippone, Balázs Kégl
TL;DR
This paper tackles the mismatch between maximum likelihood estimation and the quality of generative models by proposing a bilevel framework that learns an implicit reward from data to guide policy-gradient training. The inner level optimizes a policy gradient objective under a learned reward, while the outer level maximizes the data-driven log-likelihood, effectively aligning PG with observed data. The authors provide a tractable Gaussian analysis showing the optimal reward $\mathrm{U}^* = \frac{\lambda}{2} \Sigma^{-1}$ and connect PG with this rewards to reverse KL minimization, alongside two practical solvers (heuristic and implicit-differentiation) and empirical validation on tabular classification and model-based RL. They demonstrate that PG with an optimally learned reward can match or surpass NLL while improving moment matching, offering a general, scalable approach to applying PG to a wider range of MLE tasks. The work also contributes open-source code to facilitate reproducibility and extension to broader domains such as LLM fine-tuning and structured prediction.
Abstract
Generative models form the backbone of modern machine learning, underpinning state-of-the-art systems in text, vision, and multimodal applications. While Maximum Likelihood Estimation has traditionally served as the dominant training paradigm, recent work have highlighted its limitations, particularly in generalization and susceptibility to catastrophic forgetting compared to Reinforcement Learning techniques, such as Policy Gradient methods. However, these approaches depend on explicit reward signals, which are often unavailable in practice, leaving open the fundamental problem of how to align generative models when only high-quality datasets are accessible. In this work, we address this challenge via a Bilevel Optimization framework, where the reward function is treated as the optimization variable of an outer-level problem, while a policy gradient objective defines the inner-level. We then conduct a theoretical analysis of this optimization problem in a tractable setting and extract insights that, as we demonstrate, generalize to applications such as tabular classification and model-based reinforcement learning. We release the code at https://github.com/abenechehab/nll_to_po .
