Table of Contents
Fetching ...

Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks

Sanjay Haresh, Daniel Dijkman, Apratim Bhattacharyya, Roland Memisevic

TL;DR

This work explores a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad, and shows that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.

Abstract

Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.

Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks

TL;DR

This work explores a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad, and shows that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.

Abstract

Many dexterous manipulation tasks are non-markovian in nature, yet little attention has been paid to this fact in the recent upsurge of the vision-language-action (VLA) paradigm. Although they are successful in bringing internet-scale semantic understanding to robotics, existing VLAs are primarily "stateless" and struggle with memory-dependent long horizon tasks. In this work, we explore a way to impart both spatial and temporal memory to a VLA by incorporating a language scratchpad. The scratchpad makes it possible to memorize task-specific information, such as object positions, and it allows the model to keep track of a plan and progress towards subgoals within that plan. We evaluate this approach on a split of memory-dependent tasks from the ClevrSkills environment, on MemoryBench, as well as on a challenging real-world pick-and-place task. We show that incorporating a language scratchpad significantly improves generalization on these tasks for both non-recurrent and recurrent models.
Paper Structure (16 sections, 5 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Scratchpad-augmented VLAs. The VLA generates and updates scratchpad which is stored and provided as part of the input context for all subsequent steps, creating an explicit, evolving memory trace of its own behavior.
  • Figure 2: Example memory dependent tasks from ClevrSkills-Mem we evaluate on.
  • Figure 3: Results on the ClevrSkills-Mem benchmark. We show results for all the models considered with and without scratchpad on 5 tasks of ClevrSkills-Mem benchmark. We report success rate of each model on 50 rollouts on unseen starting positions objects. On the right we show the average performance across all tasks.
  • Figure 4: Average trajectory length of tasks in ClevrSkills-Mem. Here, TRP denotes Touch-Reset-Pick, PNR denotes Place-Next-to-Restore, ST denotes Stack-and-Toplle, Sp denotes Swap and RR denotes Rotate-Restore task respectively.
  • Figure 5: Key frames of the real world task: Pick the tomato, place it in the bowl and then restore it to the initial position.