Learning to Generate All Feasible Actions

Mirco Theile; Daniele Bernardini; Raphael Trumpp; Cristina Piazza; Marco Caccamo; Alberto L. Sangiovanni-Vincentelli

Learning to Generate All Feasible Actions

Mirco Theile, Daniele Bernardini, Raphael Trumpp, Cristina Piazza, Marco Caccamo, Alberto L. Sangiovanni-Vincentelli

TL;DR

This work tackles safe reinforcement learning under hard constraints by introducing action mapping, a two-stage framework that first learns a feasibility policy to generate all feasible actions and then learns an objective policy to select among them. Feasibility policy training is cast as a distribution-matching problem, aiming to make the policy induce a uniform distribution over the feasible action set $\mathcal{A}_s^+$, and is optimized via gradient estimators for $f$-divergences using KDE-based density estimates and importance sampling. The authors demonstrate the approach on a simple 2D illustrative example, a spline-based path-planning scenario, and a robotic grasping setup, showing that the feasibility policy can cover disconnected feasible regions and produce multiple action modes. The results indicate that focusing learning on the feasibility step can improve safety and data efficiency, while enabling later objective optimization to operate within the feasible region; this has practical implications for deploying RL in safety-critical robotic and cyber-physical systems. Future work includes scalability to higher-dimensional action spaces, adaptive density estimation, and integrating the feasibility policy with end-to-end RL algorithms for real-time control.

Abstract

Modern cyber-physical systems are becoming increasingly complex to model, thus motivating data-driven techniques such as reinforcement learning (RL) to find appropriate control agents. However, most systems are subject to hard constraints such as safety or operational bounds. Typically, to learn to satisfy these constraints, the agent must violate them systematically, which is computationally prohibitive in most systems. Recent efforts aim to utilize feasibility models that assess whether a proposed action is feasible to avoid applying the agent's infeasible action proposals to the system. However, these efforts focus on guaranteeing constraint satisfaction rather than the agent's learning efficiency. To improve the learning process, we introduce action mapping, a novel approach that divides the learning process into two steps: first learn feasibility and subsequently, the objective by mapping actions into the sets of feasible actions. This paper focuses on the feasibility part by learning to generate all feasible actions through self-supervised querying of the feasibility model. We train the agent by formulating the problem as a distribution matching problem and deriving gradient estimators for different divergences. Through an illustrative example, a robotic path planning scenario, and a robotic grasping simulation, we demonstrate the agent's proficiency in generating actions across disconnected feasible action sets. By addressing the feasibility step, this paper makes it possible to focus future work on the objective part of action mapping, paving the way for an RL framework that is both safe and efficient.

Learning to Generate All Feasible Actions

TL;DR

, and is optimized via gradient estimators for

-divergences using KDE-based density estimates and importance sampling. The authors demonstrate the approach on a simple 2D illustrative example, a spline-based path-planning scenario, and a robotic grasping setup, showing that the feasibility policy can cover disconnected feasible regions and produce multiple action modes. The results indicate that focusing learning on the feasibility step can improve safety and data efficiency, while enabling later objective optimization to operate within the feasible region; this has practical implications for deploying RL in safety-critical robotic and cyber-physical systems. Future work includes scalability to higher-dimensional action spaces, adaptive density estimation, and integrating the feasibility policy with end-to-end RL algorithms for real-time control.

Abstract

Paper Structure (24 sections, 19 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 19 equations, 9 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Optimization Problem
Action Mapping
Feasibility Policy
Methodology
f-Divergence
Gradient Estimation
Training Process
Actor-Critic
Illustrative Example
Problem
Results
Feasible Trajectory Segments Example
Problem
...and 9 more sections

Figures (9)

Figure 1: Illustrative example showing two feasibility models, which specify feasible regions as the union of three random circles (b) or annuli (c). Three states (1)-(3) are shown for each example, solved with the JS, FKL, and RKL divergence, with feasible and infeasible action space in white and black, respectively. The colored points are actions generated by the feasibility policy when using the corresponding latent space value $(z_x, z_y) \in \mathcal{Z}$ in (a).
Figure 2: Quadratic spline action space application showing three different maps: a randomly generated map in (a) and (b) and two handcrafted maps in (c) and (d). In (a), example splines are shown with green indicating a feasible spline and red indicating an infeasible one. An example for each constraint violation is given. In (b)-(d), the agent generates 256 actions that are displayed with the color depending on the feasibility of each proposed action.
Figure 3: Feasible gripper positions (red) for different variations of the shapes (H-shape (a+b), 8-shape (c+d), Spoon (e+f), and T-shape (g+h)) used in training, with a detailed view of the area between the gripper to the right of each figure.
Figure 4: Different distortions are applied, showing a colored chess board for illustration and an example shape under all distortions.
Figure 5: Before processing, the image is embedded (in gray) and augmented with positional encoding, resulting in 32 total channels. After positional encoding, a convolutional layer with stride 3, followed by 7 residual blocks (in yellow) with a bottleneck, preprocesses the state. The output is processed by 3 layers of "pixel-wise" shared MLPs (in brown), with the features being concatenated with a latent input (in purple) of length $d$. The latent input is a random sample from $\mathcal{Z}$ for the actor and the action to be evaluated for the critic. Four (for the actor) or three (for the critic) fully connected layers (in blue) output the action and the feasibility estimate, respectively.
...and 4 more figures

Learning to Generate All Feasible Actions

TL;DR

Abstract

Learning to Generate All Feasible Actions

Authors

TL;DR

Abstract

Table of Contents

Figures (9)