Table of Contents
Fetching ...

ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling

Byungjin Kim

Abstract

Deploying learned robot manipulation policies in industrial settings requires rigorous pre-deployment validation, yet exhaustive testing across high-dimensional parameter spaces is intractable. We present ROBOGATE, a deployment risk management framework that combines physics-based simulation with a two-stage adaptive sampling strategy to efficiently discover failure boundaries in the operational parameter space. Stage 1 employs Latin Hypercube Sampling (LHS) across an 8-dimensional parameter space to establish a coarse failure landscape from 20,000 uniformly distributed experiments. Stage 2 applies boundary-focused sampling that concentrates 10,000 additional experiments in the 30-70% success rate transition zone, enabling precise failure boundary mapping. Using NVIDIA Isaac Sim with Newton physics, we evaluate a scripted pick-and-place controller on two robot embodiments -- Franka Panda (7-DOF) and UR5e (6-DOF) -- across 30,000 total experiments. Our logistic regression risk model achieves an AUC of 0.780 on the combined dataset (vs. 0.754 for Stage 1 alone), identifies a closed-form failure boundary equation, and reveals four universal danger zones affecting both robot platforms. We further demonstrate the framework on VLA (Vision-Language-Action) model evaluation, where Octo-Small achieves 0.0% success rate on 68 adversarial scenarios versus 100% for the scripted baseline -- a 100-point gap that underscores the challenge of deploying foundation models in industrial settings. ROBOGATE is open-source and runs on a single GPU workstation.

ROBOGATE: Adaptive Failure Discovery for Safe Robot Policy Deployment via Two-Stage Boundary-Focused Sampling

Abstract

Deploying learned robot manipulation policies in industrial settings requires rigorous pre-deployment validation, yet exhaustive testing across high-dimensional parameter spaces is intractable. We present ROBOGATE, a deployment risk management framework that combines physics-based simulation with a two-stage adaptive sampling strategy to efficiently discover failure boundaries in the operational parameter space. Stage 1 employs Latin Hypercube Sampling (LHS) across an 8-dimensional parameter space to establish a coarse failure landscape from 20,000 uniformly distributed experiments. Stage 2 applies boundary-focused sampling that concentrates 10,000 additional experiments in the 30-70% success rate transition zone, enabling precise failure boundary mapping. Using NVIDIA Isaac Sim with Newton physics, we evaluate a scripted pick-and-place controller on two robot embodiments -- Franka Panda (7-DOF) and UR5e (6-DOF) -- across 30,000 total experiments. Our logistic regression risk model achieves an AUC of 0.780 on the combined dataset (vs. 0.754 for Stage 1 alone), identifies a closed-form failure boundary equation, and reveals four universal danger zones affecting both robot platforms. We further demonstrate the framework on VLA (Vision-Language-Action) model evaluation, where Octo-Small achieves 0.0% success rate on 68 adversarial scenarios versus 100% for the scripted baseline -- a 100-point gap that underscores the challenge of deploying foundation models in industrial settings. ROBOGATE is open-source and runs on a single GPU workstation.
Paper Structure (51 sections, 8 equations, 5 figures, 13 tables, 2 algorithms)

This paper contains 51 sections, 8 equations, 5 figures, 13 tables, 2 algorithms.

Figures (5)

  • Figure 1: RoboGate two-stage adaptive sampling pipeline. Stage 1 performs uniform Latin Hypercube Sampling across the 8D parameter space (20K experiments: Franka 10K + UR5e 10K). Stage 2 concentrates 10K boundary-focused experiments in the 30--70% success rate transition zone identified from Stage 1 results.
  • Figure 2: Cross-robot comparison between Franka Panda and UR5e. (a) Failure mode distribution: UR5e exhibits only grasp_miss failures due to suction gripper design. (b) Per-parameter success rate comparison on shared dimensions, showing UR5e consistently outperforms Franka.
  • Figure 3: Success rate heatmap across the friction $\times$ mass parameter plane (Franka, 20K experiments). The dashed white curve shows the logistic regression decision boundary (SR = 50%). Low friction and high mass regions (lower-left) exhibit near-zero success rates.
  • Figure 4: Failure boundary in friction-mass space (Franka 20K). Green dots: success, red dots: failure (3K subsample shown for clarity). The solid curve shows the logistic decision boundary $\mu^*(m)$. The region left of the boundary (low friction) is the danger zone.
  • Figure 5: ROC curve for the logistic regression failure prediction model. The combined 20K model (AUC = 0.780) outperforms the Stage 1-only model (AUC = 0.754), demonstrating the value of boundary-focused sampling for risk model training.