CAPER: Constrained and Procedural Reasoning for Robotic Scientific Experiments
Jinghan Yang, Jingyi Hou, Xinbo Yu, Wei He, Yifan Wu
TL;DR
CAPER addresses the challenge of reliable, long-horizon robotic experiments in scientific labs by enforcing procedural correctness through a constrained, responsibility-separated architecture that decouples task-level symbolic planning, mid-level multimodal grounding, and low-level control. It formalizes this separation with a symbolic plan $\,\mathcal{S}^\\star$, a diffusion-based visual predictor, and a VLM that maps subtasks to action primitives executed by a reinforcement-learning controller. The task-level planner uses Chain-of-Thought reasoning with a verification–correction loop to produce a logically consistent symbolic plan, while the mid- and low-level modules provide robust grounding and execution under perceptual and physical uncertainty. Across a scientific workflow benchmark and a public long-horizon dataset, CAPER achieves higher success and procedural correctness, particularly in low-data regimes, and demonstrates practical sim-to-real transfer with modular components. This modular, constrained approach offers a practical alternative to end-to-end vision-language-action models for robot-assisted scientific experiments with improved controllability and data efficiency.
Abstract
Robotic assistance in scientific laboratories requires procedurally correct long-horizon manipulation, reliable execution under limited supervision, and robustness in low-demonstration regimes. Such conditions greatly challenge end-to-end vision-language-action (VLA) models, whose assumptions of recoverable errors and data-driven policy learning often break down in protocol-sensitive experiments. We propose CAPER, a framework for Constrained And ProcEdural Reasoning for robotic scientific experiments, which explicitly restricts where learning and reasoning occur in the planning and control pipeline. Rather than strengthening end-to-end policies, CAPER enforces a responsibility-separated structure: task-level reasoning generates procedurally valid action sequences under explicit constraints, mid-level multimodal grounding realizes subtasks without delegating spatial decision-making to large language models, and low-level control adapts to physical uncertainty via reinforcement learning with minimal demonstrations. By encoding procedural commitments through interpretable intermediate representations, CAPER prevents execution-time violations of experimental logic, improving controllability, robustness, and data efficiency. Experiments on a scientific workflow benchmark and a public long-horizon manipulation dataset demonstrate consistent improvements in success rate and procedural correctness, particularly in low-data and long-horizon settings.
