Table of Contents
Fetching ...

Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation in Non-Markovian Simulation Benchmarks

Wang Honghui, Jing Zhi, Ao Jicong, Song Shiji, Li Xuelong, Huang Gao, Bai Chenjia

TL;DR

RuleSafe is presented, a new articulated manipulation benchmark built upon a scalable LLM-aided simulation framework, and VQ-Memory, a compact and structured temporal representation that uses vector-quantized variational autoencoders (VQ-VAEs) to encode past proprioceptive states into discrete latent tokens.

Abstract

The high cost of collecting real-robot data has made robotic simulation a scalable platform for both evaluation and data generation. Yet most existing benchmarks concentrate on simple manipulation tasks such as pick-and-place, failing to capture the non-Markovian characteristics of real-world tasks and the complexity of articulated object interactions. To address this limitation, we present RuleSafe, a new articulated manipulation benchmark built upon a scalable LLM-aided simulation framework. RuleSafe features safes with diverse unlocking mechanisms, such as key locks, password locks, and logic locks, which require different multi-stage reasoning and manipulation strategies. These LLM-generated rules produce non-Markovian and long-horizon tasks that require temporal modeling and memory-based reasoning. We further propose VQ-Memory, a compact and structured temporal representation that uses vector-quantized variational autoencoders (VQ-VAEs) to encode past proprioceptive states into discrete latent tokens. This representation filters low-level noise while preserving high-level task-phase context, providing lightweight yet robust temporal cues that are compatible with existing Vision-Language-Action models (VLA). Extensive experiments on state-of-the-art VLA models and diffusion policies show that VQ-Memory consistently improves long-horizon planning, enhances generalization to unseen configurations, and enables more efficient manipulation with reduced computational cost. Project page: vqmemory.github.io

Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation in Non-Markovian Simulation Benchmarks

TL;DR

RuleSafe is presented, a new articulated manipulation benchmark built upon a scalable LLM-aided simulation framework, and VQ-Memory, a compact and structured temporal representation that uses vector-quantized variational autoencoders (VQ-VAEs) to encode past proprioceptive states into discrete latent tokens.

Abstract

The high cost of collecting real-robot data has made robotic simulation a scalable platform for both evaluation and data generation. Yet most existing benchmarks concentrate on simple manipulation tasks such as pick-and-place, failing to capture the non-Markovian characteristics of real-world tasks and the complexity of articulated object interactions. To address this limitation, we present RuleSafe, a new articulated manipulation benchmark built upon a scalable LLM-aided simulation framework. RuleSafe features safes with diverse unlocking mechanisms, such as key locks, password locks, and logic locks, which require different multi-stage reasoning and manipulation strategies. These LLM-generated rules produce non-Markovian and long-horizon tasks that require temporal modeling and memory-based reasoning. We further propose VQ-Memory, a compact and structured temporal representation that uses vector-quantized variational autoencoders (VQ-VAEs) to encode past proprioceptive states into discrete latent tokens. This representation filters low-level noise while preserving high-level task-phase context, providing lightweight yet robust temporal cues that are compatible with existing Vision-Language-Action models (VLA). Extensive experiments on state-of-the-art VLA models and diffusion policies show that VQ-Memory consistently improves long-horizon planning, enhances generalization to unseen configurations, and enables more efficient manipulation with reduced computational cost. Project page: vqmemory.github.io
Paper Structure (37 sections, 2 equations, 6 figures, 4 tables)

This paper contains 37 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Locking rules from the RuleSafe Benchmark, structured by (a) part phase (knob/handle) and (b) task phase (password cache). Because these governing variables are not directly observable from visual input, the tasks exhibit a non-Markovian property.
  • Figure 2: Overview of the memory-augmented policy formulation. The pretrained Vision-Language Model encodes current observations and task instructions, while the memory tokens $\boldsymbol{m}_t$ provide additional historical context from previous timesteps. The combined representation is fed into the Action Expert, which predicts the future action sequence conditioned on the state $q_t$ and injected noise.
  • Figure 3: Visualization of discrete tokens of joint trajectories with (Bottom: vocabulary size = 4) and without clustering (Top: vocabulary size = 256). Each column represents a time step, and each row corresponds to one of three different trajectories, where different colors indicate different tokens. Without clustering, high-level temporal regularities are obscured by fine-grained variations, whereas clustering emphasizes shared semantic patterns across trajectories, enabling clearer identification of task execution stages.
  • Figure 4: Statistics of Demonstration Generation in RuleSafe: Success Rate and Average Frames.
  • Figure 5: List of 10 object instances used in RuleSafe.
  • ...and 1 more figures