Table of Contents
Fetching ...

Incorporating Task Progress Knowledge for Subgoal Generation in Robotic Manipulation through Image Edits

Xuhui Kang, Yen-Ling Kuo

TL;DR

TaKSIE tackles the challenge of planning in robotic manipulation by injecting task progress knowledge into subgoal generation. It jointly trains a recurrent progress encoder and a latent diffusion model to generate the next visual subgoal conditioned on the current state and language goal, with an adaptive subgoal rollout guided by a progress evaluator. Ground-truth subgoals are derived from time-contrastive video representations, enabling training with demonstrated trajectories and improving consistency across multimodal task solutions. Across simulation and real-robot experiments on CALVIN, TaKSIE achieves state-of-the-art performance and shows data efficiency and robustness to varying initial poses and demonstration speeds, highlighting the practical value of explicit task-progress awareness in visual subgoal generation.

Abstract

Understanding the progress of a task allows humans to not only track what has been done but also to better plan for future goals. We demonstrate TaKSIE, a novel framework that incorporates task progress knowledge into visual subgoal generation for robotic manipulation tasks. We jointly train a recurrent network with a latent diffusion model to generate the next visual subgoal based on the robot's current observation and the input language command. At execution time, the robot leverages a visual progress representation to monitor the task progress and adaptively samples the next visual subgoal from the model to guide the manipulation policy. We train and validate our model in simulated and real-world robotic tasks, achieving state-of-the-art performance on the CALVIN manipulation benchmark. We find that the inclusion of task progress knowledge can improve the robustness of trained policy for different initial robot poses or various movement speeds during demonstrations. The project website can be found at https://live-robotics-uva.github.io/TaKSIE/ .

Incorporating Task Progress Knowledge for Subgoal Generation in Robotic Manipulation through Image Edits

TL;DR

TaKSIE tackles the challenge of planning in robotic manipulation by injecting task progress knowledge into subgoal generation. It jointly trains a recurrent progress encoder and a latent diffusion model to generate the next visual subgoal conditioned on the current state and language goal, with an adaptive subgoal rollout guided by a progress evaluator. Ground-truth subgoals are derived from time-contrastive video representations, enabling training with demonstrated trajectories and improving consistency across multimodal task solutions. Across simulation and real-robot experiments on CALVIN, TaKSIE achieves state-of-the-art performance and shows data efficiency and robustness to varying initial poses and demonstration speeds, highlighting the practical value of explicit task-progress awareness in visual subgoal generation.

Abstract

Understanding the progress of a task allows humans to not only track what has been done but also to better plan for future goals. We demonstrate TaKSIE, a novel framework that incorporates task progress knowledge into visual subgoal generation for robotic manipulation tasks. We jointly train a recurrent network with a latent diffusion model to generate the next visual subgoal based on the robot's current observation and the input language command. At execution time, the robot leverages a visual progress representation to monitor the task progress and adaptively samples the next visual subgoal from the model to guide the manipulation policy. We train and validate our model in simulated and real-world robotic tasks, achieving state-of-the-art performance on the CALVIN manipulation benchmark. We find that the inclusion of task progress knowledge can improve the robustness of trained policy for different initial robot poses or various movement speeds during demonstrations. The project website can be found at https://live-robotics-uva.github.io/TaKSIE/ .

Paper Structure

This paper contains 46 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: An illustration of the idea that a robot generates the visual subgoals incrementally (shown as the increasing numbers) for an input language command. Note that these images are generated using our subgoal generator. The generated subgoals reveal that the robot needs to understand the precondition, e.g., empty the gripper before grasping, and the preferred pose for grasp, e.g., moving it to the right side of the block.
  • Figure 2: An overview of TaKSIE, a framework that incorporates task progress knowledge (encoded in the Progress Encoder and Progress Evaluation) into language-conditioned robotic manipulation tasks using generated subgoals as its conditions for a low-level policy.
  • Figure 3: Comparison between TaKSIE ground-truth (GT) subgoal selection (red arrows, two frames on the right) and fixed-interval subgoal selection (black arrows, two frames at the top).
  • Figure 4: An example rollout demonstrating how the generated subgoals guide the robot for a task: rotate the blue block left.
  • Figure 5: An example for the task "rotate the red block to the left," illustrating how different embeddings affect subgoal selection.
  • ...and 6 more figures