Table of Contents
Fetching ...

RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization

Kai Fukazawa, Kunal Mundada, Iman Soltani

TL;DR

RAMAC tackles risk-aware offline reinforcement learning with expressive multimodal policies by pairing a diffusion/flow-based actor with a distributional critic and a BC+CVaR objective. The method provides an explicit tail-risk steer through CVaR while strong behavior regularization keeps the policy on data support, yielding improved CVaR with competitive mean returns and reduced OOD action rates. The authors back their approach with geometric analysis showing forward-KL bounds on OOD and demonstrate empirical gains on Stochastic-D4RL tasks, including a diffusion-based RADAC and a flow-based RAFMAC variant. This work advances safe, expressive offline control and highlights the importance of direct tail-risk propagation through differentiable policy trajectories.

Abstract

In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) offers an attractive alternative but only if policies deliver high returns without incurring catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of value or model-based pessimism, and restricted policy classes that limit policy expressiveness, whereas diffusion/flow-based expressive generative policies trained with a behavioral-cloning (BC) objective have been used only in risk-neutral settings. Here, we address this gap by introducing the \textbf{Risk-Aware Multimodal Actor-Critic (RAMAC)}, which couples an expressive generative actor with a distributional critic and, to our knowledge, is the first model-free approach that learns \emph{risk-aware expressive generative policies}. RAMAC differentiates a composite objective that adds a Conditional Value-at-Risk (CVaR) term to a BC loss, achieving risk-sensitive learning in complex multimodal scenarios. Since out-of-distribution (OOD) actions are a major driver of catastrophic failures in offline RL, we further analyze OOD behavior under prior-anchored perturbation schemes from recent BC-regularized risk-averse offline RL. This clarifies why a behavior-regularized objective that directly constrains the expressive generative policy to the dataset support provides an effective, risk-agnostic mechanism for suppressing OOD actions in modern expressive policies. We instantiate RAMAC with a diffusion-based actor, using it both to illustrate the analysis in a 2-D risky bandit and to deploy OOD-action detectors on Stochastic-D4RL benchmarks, empirically validating our insights. Across these tasks, we observe consistent gains in $\mathrm{CVaR}_{0.1}$ while maintaining strong returns. Our implementation is available at GitHub: https://github.com/KaiFukazawa/RAMAC.git

RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization

TL;DR

RAMAC tackles risk-aware offline reinforcement learning with expressive multimodal policies by pairing a diffusion/flow-based actor with a distributional critic and a BC+CVaR objective. The method provides an explicit tail-risk steer through CVaR while strong behavior regularization keeps the policy on data support, yielding improved CVaR with competitive mean returns and reduced OOD action rates. The authors back their approach with geometric analysis showing forward-KL bounds on OOD and demonstrate empirical gains on Stochastic-D4RL tasks, including a diffusion-based RADAC and a flow-based RAFMAC variant. This work advances safe, expressive offline control and highlights the importance of direct tail-risk propagation through differentiable policy trajectories.

Abstract

In safety-critical domains where online data collection is infeasible, offline reinforcement learning (RL) offers an attractive alternative but only if policies deliver high returns without incurring catastrophic lower-tail risk. Prior work on risk-averse offline RL achieves safety at the cost of value or model-based pessimism, and restricted policy classes that limit policy expressiveness, whereas diffusion/flow-based expressive generative policies trained with a behavioral-cloning (BC) objective have been used only in risk-neutral settings. Here, we address this gap by introducing the \textbf{Risk-Aware Multimodal Actor-Critic (RAMAC)}, which couples an expressive generative actor with a distributional critic and, to our knowledge, is the first model-free approach that learns \emph{risk-aware expressive generative policies}. RAMAC differentiates a composite objective that adds a Conditional Value-at-Risk (CVaR) term to a BC loss, achieving risk-sensitive learning in complex multimodal scenarios. Since out-of-distribution (OOD) actions are a major driver of catastrophic failures in offline RL, we further analyze OOD behavior under prior-anchored perturbation schemes from recent BC-regularized risk-averse offline RL. This clarifies why a behavior-regularized objective that directly constrains the expressive generative policy to the dataset support provides an effective, risk-agnostic mechanism for suppressing OOD actions in modern expressive policies. We instantiate RAMAC with a diffusion-based actor, using it both to illustrate the analysis in a 2-D risky bandit and to deploy OOD-action detectors on Stochastic-D4RL benchmarks, empirically validating our insights. Across these tasks, we observe consistent gains in while maintaining strong returns. Our implementation is available at GitHub: https://github.com/KaiFukazawa/RAMAC.git

Paper Structure

This paper contains 67 sections, 2 theorems, 41 equations, 14 figures, 6 tables, 1 algorithm.

Key Result

Lemma 4.1

Fix $s$ and write $I_s = \mathcal{S}_G(s)$ and $O_s = \mathbb R^d \setminus I_s$. Suppose there exist an anchor $b^\star \in I_s$ and a radius $\Phi > 0$ such that $\lambda\!(B_\Phi(b^\star)\cap O_s) > 0$, and the policy $\pi_{\text{anch}}(\cdot\mid s)$ induced by Eq. eq:perturbation admits a densit Then its per-state OOD probability satisfies In particular, as long as the density on $B_\Phi(b^\

Figures (14)

  • Figure 1: RAMAC pipeline. From the offline buffer $\mathcal{D}$(gray), the distributional critic $Z_\phi$(green) fits the return law with a quantile loss and aggregates its lower tail into a CVaR signal. That signal is differentiated through the generative path of the actor $\pi_\theta$(blue; diffusion or flow), which is trained with the composite objective $\mathcal{L}_{\pi}=\mathcal{L}_{\mathrm{BC}}+\eta\,\mathcal{L}_{\mathrm{Risk}}$ to shift mass away from low-quantile regions while staying on-manifold.
  • Figure 2: RAMAC learning dynamics (conceptual).Top: policy density $\pi_\theta(a\!\mid\!s)$ induced by the reparameterized actor $a=\psi_\theta(s,z)$(Eq. \ref{['eq:reparam_action']}) over training. Bottom: critic return distribution $Z_\phi(s,a,\tau)$ with low quantiles highlighted (red); the actor is updated by the CVaR objective (Eqs. \ref{['eq:cvar_def_keep']}--\ref{['eq:policy_loss_full']}) while the critic is trained via the IQN loss (Eq. \ref{['eq:critic_compact']}). CVaR updates steer mass away from low-quantile regions while preserving multimodal high-reward modes.
  • Figure 3: Toy Risky Bandit ResultsTop: Ground truth consists of a safe center mode yellow-green and a risky ring where high-reward samples yellow are interspersed with catastrophic penalties (purple). Risk-neutral generative baselines concentrate on the risky ring or collapse topology. Bottom: Prior‑anchored perturbation methods produce samples in the low‑density inter‑mode region, exhibiting OOD leakage. RADAC concentrates near the safe center without losing multimodality. See App. \ref{['app:additional_toy']} for more results.
  • Figure 4: Policy distributions for RADAC, ORAAC, and DiffusionQL; shaded bands indicate safe operational ranges (HalfCheetah: $v\!\le\!10$ for m-e, $v\!\le\!5$ for m-r; Hopper: $|\theta|\!\le\!0.1$; Walker2d: $|\theta|\!\le\!0.5$). RADAC reduces mass beyond thresholds.
  • Figure 5: Behavior cloning on the Risk Bandit dataset. Each panel shows i.i.d. samples from the BC Policy. CVAE‑BC mixes modes and places points in the low‑density gap; Diffusion‑BC reproduces both the outer ring and the central cluster; Flow‑Matching BC yields a crisp ring but assigns less mass to the center.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Lemma 4.1
  • Proposition 1
  • proof
  • proof