Table of Contents
Fetching ...

Leveraging Locality to Boost Sample Efficiency in Robotic Manipulation

Tong Zhang, Yingdong Hu, Jiacheng You, Yang Gao

TL;DR

SGRv2, an imitation learning framework that enhances sample efficiency through improved visual and action representations, is introduced, which posits that robot's actions are predominantly influenced by the target object and its interactions with the local environment.

Abstract

Given the high cost of collecting robotic data in the real world, sample efficiency is a consistently compelling pursuit in robotics. In this paper, we introduce SGRv2, an imitation learning framework that enhances sample efficiency through improved visual and action representations. Central to the design of SGRv2 is the incorporation of a critical inductive bias-action locality, which posits that robot's actions are predominantly influenced by the target object and its interactions with the local environment. Extensive experiments in both simulated and real-world settings demonstrate that action locality is essential for boosting sample efficiency. SGRv2 excels in RLBench tasks with keyframe control using merely 5 demonstrations and surpasses the RVT baseline in 23 of 26 tasks. Furthermore, when evaluated on ManiSkill2 and MimicGen using dense control, SGRv2's success rate is 2.54 times that of SGR. In real-world environments, with only eight demonstrations, SGRv2 can perform a variety of tasks at a markedly higher success rate compared to baseline models. Project website: http://sgrv2-robot.github.io

Leveraging Locality to Boost Sample Efficiency in Robotic Manipulation

TL;DR

SGRv2, an imitation learning framework that enhances sample efficiency through improved visual and action representations, is introduced, which posits that robot's actions are predominantly influenced by the target object and its interactions with the local environment.

Abstract

Given the high cost of collecting robotic data in the real world, sample efficiency is a consistently compelling pursuit in robotics. In this paper, we introduce SGRv2, an imitation learning framework that enhances sample efficiency through improved visual and action representations. Central to the design of SGRv2 is the incorporation of a critical inductive bias-action locality, which posits that robot's actions are predominantly influenced by the target object and its interactions with the local environment. Extensive experiments in both simulated and real-world settings demonstrate that action locality is essential for boosting sample efficiency. SGRv2 excels in RLBench tasks with keyframe control using merely 5 demonstrations and surpasses the RVT baseline in 23 of 26 tasks. Furthermore, when evaluated on ManiSkill2 and MimicGen using dense control, SGRv2's success rate is 2.54 times that of SGR. In real-world environments, with only eight demonstrations, SGRv2 can perform a variety of tasks at a markedly higher success rate compared to baseline models. Project website: http://sgrv2-robot.github.io
Paper Structure (27 sections, 4 equations, 7 figures, 13 tables)

This paper contains 27 sections, 4 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Left: Sample efficiency of SGRv2. We evaluate SGR and SGRv2 on 26 RLBench tasks, with demonstration numbers ranging from 100 to 5. Results indicate that, owing to the locality of SGRv2, it exhibits exceptional sample efficiency, with its success rate declining by only about 10%. Top Right: Overview of simulation results. We test SGRv2 on 3 benchmarks, consistently outperforming the baselines. Bottom Right: Tasks of the 3 simulation benchmarks.
  • Figure 2: SGRv2 Architecture. Built upon SGR, SGRv2 integrates locality into its framework through four primary designs: an encoder-decoder architecture for point-wise features, a strategy for predicting relative target position, a weighted average for focusing on critical local regions, and a dense supervision strategy (not shown in the figure). This illustration specifically represents the water plants task. For simplicity in the visualization, we omit the proprioceptive data that is concatenated with the RGB of the point cloud before being fed into the geometric branch.
  • Figure 3: Emergent Capabilities. We visualize the point-specific weights and find that the points with high weights (in red) align with the object's affordances. Left: toilet seat up. Right: open microwave.
  • Figure 4: Left: Real-robot long-horizon tasks. Right: Success rate (%) of multi-task agents on real-robot tasks. We collect 8 demonstrations and evaluate 10 episodes for each task.
  • Figure 5: Simulation Tasks. Our simulation experiments encompass 26 tasks (1-26) from RLBench, 4 tasks (27-37, where 30-37 are 8 different YCB calli2015ycb objects of task Pick Single YCB) from ManiSkill2, and 7 tasks (38-44) from MimicGen.
  • ...and 2 more figures