Table of Contents
Fetching ...

GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL

Haoyu Wang, Jingcheng Wang, Shunyu Wu, Xinwei Xiao

Abstract

Offline reinforcement learning (RL) can fit strong value functions from fixed datasets, yet reliable deployment still hinges on the action selection interface used to query them. When the dataset induces a branched or multimodal action landscape, unimodal policy extraction can blur competing hypotheses and yield "in-between" actions that are weakly supported by data, making decisions brittle even with a strong critic. We introduce GEM (Guided Expectation-Maximization), an analytical framework that makes action selection both multimodal and explicitly controllable. GEM trains a Gaussian Mixture Model (GMM) actor via critic-guided, advantage-weighted EM-style updates that preserve distinct components while shifting probability mass toward high-value regions, and learns a tractable GMM behavior model to quantify support. During inference, GEM performs candidate-based selection: it generates a parallel candidate set and reranks actions using a conservative ensemble lower-confidence bound together with behavior-normalized support, where the behavior log-likelihood is standardized within each state's candidate set to yield stable, comparable control across states and candidate budgets. Empirically, GEM is competitive across D4RL benchmarks, and offers a simple inference-time budget knob (candidate count) that trades compute for decision quality without retraining.

GEM: Guided Expectation-Maximization for Behavior-Normalized Candidate Action Selection in Offline RL

Abstract

Offline reinforcement learning (RL) can fit strong value functions from fixed datasets, yet reliable deployment still hinges on the action selection interface used to query them. When the dataset induces a branched or multimodal action landscape, unimodal policy extraction can blur competing hypotheses and yield "in-between" actions that are weakly supported by data, making decisions brittle even with a strong critic. We introduce GEM (Guided Expectation-Maximization), an analytical framework that makes action selection both multimodal and explicitly controllable. GEM trains a Gaussian Mixture Model (GMM) actor via critic-guided, advantage-weighted EM-style updates that preserve distinct components while shifting probability mass toward high-value regions, and learns a tractable GMM behavior model to quantify support. During inference, GEM performs candidate-based selection: it generates a parallel candidate set and reranks actions using a conservative ensemble lower-confidence bound together with behavior-normalized support, where the behavior log-likelihood is standardized within each state's candidate set to yield stable, comparable control across states and candidate budgets. Empirically, GEM is competitive across D4RL benchmarks, and offers a simple inference-time budget knob (candidate count) that trades compute for decision quality without retraining.
Paper Structure (134 sections, 88 equations, 19 figures, 13 tables, 2 algorithms)

This paper contains 134 sections, 88 equations, 19 figures, 13 tables, 2 algorithms.

Figures (19)

  • Figure 1: GEM mechanism (schematic at a fixed state in a 2-D action space).Training (top row): GEM learns three ingredients that will be combined only at test time: (i) a multimodal actor $\pi_\theta(a\mid s)$ (left; shown as $\log\pi_\theta(a\mid s)$) that preserves multiple plausible action hypotheses, (ii) an independent behavior density $\mu_\varphi(a\mid s)$ (middle; shown as $\log\mu_\varphi(a\mid s)$) used to quantify dataset support, and (iii) a conservative value statistic $\mathrm{LCB}_\lambda(s,a)$ from a critic ensemble (right). Inference (bottom row): given a queried state $s$, GEM samples $N$ candidates in parallel and adds a deterministic anchor (triangle), scores each candidate using Eq. \ref{['eq:score']} (conservative value plus behavior-normalized support), and executes the top-ranked action $a^\star$ (star). The green cluster highlights the small subset of candidates that remain competitive after scoring, illustrating how the interface filters many proposals down to a support-aligned, high-value choice under maximization.
  • Figure 2: Suite-level scaling with candidate budget $N$ (Score on top). We report suite means for normalized score and the two deployment audits as functions of $N$. Environment-level breakdowns are in Appendix \ref{['app:env_breakdowns']}.
  • Figure 3: Deployment compute profile under the shared measurement harness (latency and peak memory) across all logged methods and sweep settings.
  • Figure 4: NLL-gap diagnostic summary. Larger positive gaps indicate stronger mixture benefits over a top-1 unimodal proxy.
  • Figure 5: Locomotion suite: per-environment candidate-budget scaling for Score.
  • ...and 14 more figures