Table of Contents
Fetching ...

CAPER: Constrained and Procedural Reasoning for Robotic Scientific Experiments

Jinghan Yang, Jingyi Hou, Xinbo Yu, Wei He, Yifan Wu

TL;DR

CAPER addresses the challenge of reliable, long-horizon robotic experiments in scientific labs by enforcing procedural correctness through a constrained, responsibility-separated architecture that decouples task-level symbolic planning, mid-level multimodal grounding, and low-level control. It formalizes this separation with a symbolic plan $\,\mathcal{S}^\\star$, a diffusion-based visual predictor, and a VLM that maps subtasks to action primitives executed by a reinforcement-learning controller. The task-level planner uses Chain-of-Thought reasoning with a verification–correction loop to produce a logically consistent symbolic plan, while the mid- and low-level modules provide robust grounding and execution under perceptual and physical uncertainty. Across a scientific workflow benchmark and a public long-horizon dataset, CAPER achieves higher success and procedural correctness, particularly in low-data regimes, and demonstrates practical sim-to-real transfer with modular components. This modular, constrained approach offers a practical alternative to end-to-end vision-language-action models for robot-assisted scientific experiments with improved controllability and data efficiency.

Abstract

Robotic assistance in scientific laboratories requires procedurally correct long-horizon manipulation, reliable execution under limited supervision, and robustness in low-demonstration regimes. Such conditions greatly challenge end-to-end vision-language-action (VLA) models, whose assumptions of recoverable errors and data-driven policy learning often break down in protocol-sensitive experiments. We propose CAPER, a framework for Constrained And ProcEdural Reasoning for robotic scientific experiments, which explicitly restricts where learning and reasoning occur in the planning and control pipeline. Rather than strengthening end-to-end policies, CAPER enforces a responsibility-separated structure: task-level reasoning generates procedurally valid action sequences under explicit constraints, mid-level multimodal grounding realizes subtasks without delegating spatial decision-making to large language models, and low-level control adapts to physical uncertainty via reinforcement learning with minimal demonstrations. By encoding procedural commitments through interpretable intermediate representations, CAPER prevents execution-time violations of experimental logic, improving controllability, robustness, and data efficiency. Experiments on a scientific workflow benchmark and a public long-horizon manipulation dataset demonstrate consistent improvements in success rate and procedural correctness, particularly in low-data and long-horizon settings.

CAPER: Constrained and Procedural Reasoning for Robotic Scientific Experiments

TL;DR

CAPER addresses the challenge of reliable, long-horizon robotic experiments in scientific labs by enforcing procedural correctness through a constrained, responsibility-separated architecture that decouples task-level symbolic planning, mid-level multimodal grounding, and low-level control. It formalizes this separation with a symbolic plan , a diffusion-based visual predictor, and a VLM that maps subtasks to action primitives executed by a reinforcement-learning controller. The task-level planner uses Chain-of-Thought reasoning with a verification–correction loop to produce a logically consistent symbolic plan, while the mid- and low-level modules provide robust grounding and execution under perceptual and physical uncertainty. Across a scientific workflow benchmark and a public long-horizon dataset, CAPER achieves higher success and procedural correctness, particularly in low-data regimes, and demonstrates practical sim-to-real transfer with modular components. This modular, constrained approach offers a practical alternative to end-to-end vision-language-action models for robot-assisted scientific experiments with improved controllability and data efficiency.

Abstract

Robotic assistance in scientific laboratories requires procedurally correct long-horizon manipulation, reliable execution under limited supervision, and robustness in low-demonstration regimes. Such conditions greatly challenge end-to-end vision-language-action (VLA) models, whose assumptions of recoverable errors and data-driven policy learning often break down in protocol-sensitive experiments. We propose CAPER, a framework for Constrained And ProcEdural Reasoning for robotic scientific experiments, which explicitly restricts where learning and reasoning occur in the planning and control pipeline. Rather than strengthening end-to-end policies, CAPER enforces a responsibility-separated structure: task-level reasoning generates procedurally valid action sequences under explicit constraints, mid-level multimodal grounding realizes subtasks without delegating spatial decision-making to large language models, and low-level control adapts to physical uncertainty via reinforcement learning with minimal demonstrations. By encoding procedural commitments through interpretable intermediate representations, CAPER prevents execution-time violations of experimental logic, improving controllability, robustness, and data efficiency. Experiments on a scientific workflow benchmark and a public long-horizon manipulation dataset demonstrate consistent improvements in success rate and procedural correctness, particularly in low-data and long-horizon settings.
Paper Structure (38 sections, 3 theorems, 12 equations, 22 figures, 9 tables)

This paper contains 38 sections, 3 theorems, 12 equations, 22 figures, 9 tables.

Key Result

Lemma 1

If the execution policy $\pi$ depends only on observations $o_{1:T}$ given $\mathcal{S}$, i.e., then the goal $\mathcal{G}$ influences execution only through the symbolic plan $\mathcal{S}$.

Figures (22)

  • Figure 1: Overview of the CAPER framework. CAPER decomposes long-horizon tasks into modular stages. First, an LLM generates procedurally valid sequence of high-level subtasks in symbolic space, without access to visual or control signals. A multimodal planner, composed of a multimodal predictor and a VLM, then grounds each subtask by conditioning on the current and predicted goal images, producing sequences of action primitives. Finally, a low-level controller executes these primitives in the environment, enabling safe and consistent task completion.
  • Figure 2: An example of the task-level planner using CoT reasoning. The LLM is guided through a 4-step process: (1) clarify the goal, (2) identify required objects and conditions, (3) decompose into basic operations, and (4) validate task completion.
  • Figure 3: An example prompt used for the VLM.
  • Figure 4: Examples of predicted future visual frames generated by the multimodal predictor under different task conditions.
  • Figure 5: Visual clarity and semantic relevance of predicted images.
  • ...and 17 more figures

Theorems & Definitions (3)

  • Lemma 1: Task-symbolic sufficiency
  • Lemma 2: Conditional independence
  • Proposition 1: Factorization of success probability