Table of Contents
Fetching ...

History-Aware Visuomotor Policy Learning via Point Tracking

Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, Cewu Lu

Abstract

Many manipulation tasks require memory beyond the current observation, yet most visuomotor policies rely on the Markov assumption and thus struggle with repeated states or long-horizon dependencies. Existing methods attempt to extend observation horizons but remain insufficient for diverse memory requirements. To this end, we propose an object-centric history representation based on point tracking, which abstracts past observations into a compact and structured form that retains only essential task-relevant information. Tracked points are encoded and aggregated at the object level, yielding a compact history representation that can be seamlessly integrated into various visuomotor policies. Our design provides full history-awareness with high computational efficiency, leading to improved overall task performance and decision accuracy. Through extensive evaluations on diverse manipulation tasks, we show that our method addresses multiple facets of memory requirements - such as task stage identification, spatial memorization, and action counting, as well as longer-term demands like continuous and pre-loaded memory - and consistently outperforms both Markovian baselines and prior history-based approaches. Project website: http://tonyfang.net/history

History-Aware Visuomotor Policy Learning via Point Tracking

Abstract

Many manipulation tasks require memory beyond the current observation, yet most visuomotor policies rely on the Markov assumption and thus struggle with repeated states or long-horizon dependencies. Existing methods attempt to extend observation horizons but remain insufficient for diverse memory requirements. To this end, we propose an object-centric history representation based on point tracking, which abstracts past observations into a compact and structured form that retains only essential task-relevant information. Tracked points are encoded and aggregated at the object level, yielding a compact history representation that can be seamlessly integrated into various visuomotor policies. Our design provides full history-awareness with high computational efficiency, leading to improved overall task performance and decision accuracy. Through extensive evaluations on diverse manipulation tasks, we show that our method addresses multiple facets of memory requirements - such as task stage identification, spatial memorization, and action counting, as well as longer-term demands like continuous and pre-loaded memory - and consistently outperforms both Markovian baselines and prior history-based approaches. Project website: http://tonyfang.net/history

Paper Structure

This paper contains 19 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: History-Aware Visuomotor Policy Architecture.(left) We first identify and segment task-related objects via SAM sam2. From each object, several points are sampled and tracked using an off-the-shelf point tracker cotrackertapip3d, producing point tracks as our history representations. Each track is then encoded into a track token, and the tokens from all points of an object are aggregated into a single history-aware object track token. These object track tokens, together with the original observation tokens and other tokens, are fed into the transformer backbone of visuomotor policies actrisedp to generate history-aware robot actions. (right) The track encoder first encodes each track patch using an MLP, then applies a cross-attention module with temporal positional encodings to produce a compact point track token from point tracks of arbitrary length, effectively compressing and preserving historical information.
  • Figure 2: Different History Representations Comparisons. Suppose the full history horizon is $t$ and the observation image size is $H\times W$. (Gray) Treating history as video is redundant and computationally expensive. (Orange) History overlays are ineffective for short horizons and cluttered for long horizons. (Green) Our object-centric point track representation is efficient while preserving motion patterns and object-centric dynamics across all horizons.
  • Figure 3: Tasks and Evaluation Aspects.(Top) Five evaluation aspects of history-awareness: (1) counting evaluates the policy's ability to track repeated actions; (2) spatial memorization tests remembering spatial locations during manipulation; (3) task stage identification requires inferring the correct phase in long-horizon tasks; (4) pre-loaded memory assesses use of pre-loaded information for decision-making; and (5) continuous memory examines the ability to retain continuous history for critical decisions. (Bottom) Detailed descriptions show the full process of each task, specifying every decision phase (D1, D2, $\cdots$) where history is required for correct action decisions (green arrow). The top-right bookmarks of each task indicate which aspects of history-awareness are required for this task.
  • Figure 4: Evaluation Results on Five History-Awareness Aspects. Average success rate and decision accuracy across five history-awareness aspects are reported, with each dimension representing the average performance of the policies on the corresponding tasks.
  • Figure 5: Comparisons of Different Trackers on the Guess-Hard Task. 2D trackers suffer from tracking failures under occlusions, which frequently happen during manipulations, while 3D trackers maintain more robust point tracking performance.
  • ...and 1 more figures