Table of Contents
Fetching ...

Keyframe-Guided Structured Rewards for Reinforcement Learning in Long-Horizon Laboratory Robotics

Yibo Qiu, Shu'ang Sun, Haoliang Ye, Ronald X Xu, Mingzhai Sun

TL;DR

A Keyframe-Guided Reward Generation Framework that automatically extracts kinematics-aware keyframes from demonstrations, generates stage-wise targets via a diffusion-based predictor in latent space, and constructs a geometric progress-based reward to guide online reinforcement learning is proposed.

Abstract

Long-horizon precision manipulation in laboratory automation, such as pipette tip attachment and liquid transfer, requires policies that respect strict procedural logic while operating in continuous, high-dimensional state spaces. However, existing approaches struggle with reward sparsity, multi-stage structural constraints, and noisy or imperfect demonstrations, leading to inefficient exploration and unstable convergence. We propose a Keyframe-Guided Reward Generation Framework that automatically extracts kinematics-aware keyframes from demonstrations, generates stage-wise targets via a diffusion-based predictor in latent space, and constructs a geometric progress-based reward to guide online reinforcement learning. The framework integrates multi-view visual encoding, latent similarity-based progress tracking, and human-in-the-loop reinforcement fine-tuning on a Vision-Language-Action backbone to align policy optimization with the intrinsic stepwise logic of biological protocols. Across four real-world laboratory tasks, including high-precision pipette attachment and dynamic liquid transfer, our method achieves an average success rate of 82% after 40--60 minutes of online fine-tuning. Compared with HG-DAgger (42%) and Hil-ConRFT (47%), our approach demonstrates the effectiveness of structured keyframe-guided rewards in overcoming exploration bottlenecks and providing a scalable solution for high-precision, long-horizon robotic laboratory automation.

Keyframe-Guided Structured Rewards for Reinforcement Learning in Long-Horizon Laboratory Robotics

TL;DR

A Keyframe-Guided Reward Generation Framework that automatically extracts kinematics-aware keyframes from demonstrations, generates stage-wise targets via a diffusion-based predictor in latent space, and constructs a geometric progress-based reward to guide online reinforcement learning is proposed.

Abstract

Long-horizon precision manipulation in laboratory automation, such as pipette tip attachment and liquid transfer, requires policies that respect strict procedural logic while operating in continuous, high-dimensional state spaces. However, existing approaches struggle with reward sparsity, multi-stage structural constraints, and noisy or imperfect demonstrations, leading to inefficient exploration and unstable convergence. We propose a Keyframe-Guided Reward Generation Framework that automatically extracts kinematics-aware keyframes from demonstrations, generates stage-wise targets via a diffusion-based predictor in latent space, and constructs a geometric progress-based reward to guide online reinforcement learning. The framework integrates multi-view visual encoding, latent similarity-based progress tracking, and human-in-the-loop reinforcement fine-tuning on a Vision-Language-Action backbone to align policy optimization with the intrinsic stepwise logic of biological protocols. Across four real-world laboratory tasks, including high-precision pipette attachment and dynamic liquid transfer, our method achieves an average success rate of 82% after 40--60 minutes of online fine-tuning. Compared with HG-DAgger (42%) and Hil-ConRFT (47%), our approach demonstrates the effectiveness of structured keyframe-guided rewards in overcoming exploration bottlenecks and providing a scalable solution for high-precision, long-horizon robotic laboratory automation.
Paper Structure (23 sections, 16 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 16 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of our Keyframe-guided Online RL framework. Given an initial observation, the framework generates a sequence of multi-view keyframes (e.g., Position $\rightarrow$ Aspirate $\rightarrow$ Lift) to represent the target task. These keyframes act as intermediate goals to provide stage-wise rewards via state-similarity calculation. Compared to state-of-the-art baselines, our approach significantly accelerates convergence and achieves a superior average success rate of 82%, outperforming the baseline by 47%.
  • Figure 2: Overview of our Keyframe-guided RL framework. First, the system extracts keyframes by filtering feature dynamics from demonstrations. A diffusion-based generator is then trained to predict these keyframe sequences from initial observations. During online RL, the reward module calculates latent similarity to provide stage-wise guidance. These rewards drive the Hil-ConRFT update for policy learning.
  • Figure 3: Biological Laboratory task illustrations and common failure modes. (A) Petri Dish De-lidding. This task involves using a gripper to lift and relocate the lid of a petri dish to the side. Common failures include premature release, or colliding with the dish base due to insufficient height. (B) Centrifuge Tube Loading. This task involves picking a tube from a rack and placing it into a centrifuge slot. Common failures include missing the tube, dropping it early, or colliding with the centrifuge rim. (C) Precision Liquid Transfer. This task involves absorbing liquid from a tube and spraying it into a petri dish. Common failures include misalignment with the tube opening, colliding with the tube wall, or spraying liquid before reaching the target. (D) Pipette Tip Attachment. This task involves inserting the pipette into a tip box to secure a new tip. Common failures include misalignment, insertion errors, or the tip slipping off during raising.
  • Figure 4: Impact of keyframe introduction. Precision liquid transfer requires a strict sequence from insertion to dispensing. Unlike our kinematics-heuristic extraction, the baseline uses uniform sampling and misses critical bottleneck states like insertion and aspiration. This leads to erroneous spatial guidance and execution failure. In contrast, our method anchors these intermediate states, ensuring correct step-wise logic.
  • Figure 5: Learning curves during online training. This figure presents the success rates, intervention rates, and episode lengths for HIL-ConRFT, HG-DAgger, HIL-SERL, and our method across four representative laboratory tasks. The metrics are displayed as a running average over 10 episodes against training time in minutes.