Table of Contents
Fetching ...

NoTVLA: Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation

Zheng Huang, Mingyu Liu, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Xiaoman Li, Yiduo Jia, Hao Zhong, Hao Chen, Chunhua Shen

TL;DR

NoTVLA tackles catastrophic forgetting in Vision-Language-Action models by replacing dense action trajectories with sparse, semantically aligned keyframes focused on the end-effector, achieved through temporal compression and spatial pruning. It introduces anchor-based depth inference (APP) and anchor-conditioned token generation (ACTG), plus a kinematics-driven keyframe selection and a spline-based detokenizer to convert sparse tokens into high-frequency, smooth trajectories. Training on sparse supervision enables zero-shot and cross-embodiment generalization with an order-of-magnitude reduction in compute and without a wrist-mounted camera, while preserving the model's language capabilities for flexible instruction following. The approach yields strong multi-task performance, robust zero-shot generalization, and closer alignment to single-task expert behavior, offering a practical path to scalable, generalist robotic manipulation across diverse embodiments.

Abstract

Vision-Language-Action (VLA) models represent a pivotal advance in embodied intelligence, yet they confront critical barriers to real-world deployment, most notably catastrophic forgetting. This issue stems from their overreliance on continuous action sequences or action chunks, which inadvertently create isolated data silos that disrupt knowledge retention across tasks. To tackle these challenges, we propose the Narrowing of Trajectory VLA (NoTVLA) framework: a novel approach that narrows its focus to sparse trajectories, thereby avoiding the catastrophic forgetting associated with dense trajectory fine-tuning. A key innovation of NoTVLA lies in its trajectory planning strategy: instead of centering on the target object's trajectory, it leverages temporal compression and spatial reasoning pruning specifically for the robot end effector's trajectory. Furthermore, training is conducted using these sparse trajectories rather than dense action trajectories, an optimization that delivers remarkable practical advantages with better performance in zero-shot. In multi-task evaluation scenarios, NoTVLA achieves superior performance and generalization compared to pi0 while operating under two critical constraints: it uses over an order of magnitude less computing power than pi0 and requires no wrist-mounted camera. This design ensures that NoTVLA's operational accuracy closely approximates that of single-task expert models. Crucially, it also preserves the model's inherent language capabilities, enabling zero-shot generalization in specific scenarios, supporting unified model deployment across multiple robot platforms, and fostering a degree of generalization even when perceiving tasks from novel perspectives.

NoTVLA: Narrowing of Dense Action Trajectories for Generalizable Robot Manipulation

TL;DR

NoTVLA tackles catastrophic forgetting in Vision-Language-Action models by replacing dense action trajectories with sparse, semantically aligned keyframes focused on the end-effector, achieved through temporal compression and spatial pruning. It introduces anchor-based depth inference (APP) and anchor-conditioned token generation (ACTG), plus a kinematics-driven keyframe selection and a spline-based detokenizer to convert sparse tokens into high-frequency, smooth trajectories. Training on sparse supervision enables zero-shot and cross-embodiment generalization with an order-of-magnitude reduction in compute and without a wrist-mounted camera, while preserving the model's language capabilities for flexible instruction following. The approach yields strong multi-task performance, robust zero-shot generalization, and closer alignment to single-task expert behavior, offering a practical path to scalable, generalist robotic manipulation across diverse embodiments.

Abstract

Vision-Language-Action (VLA) models represent a pivotal advance in embodied intelligence, yet they confront critical barriers to real-world deployment, most notably catastrophic forgetting. This issue stems from their overreliance on continuous action sequences or action chunks, which inadvertently create isolated data silos that disrupt knowledge retention across tasks. To tackle these challenges, we propose the Narrowing of Trajectory VLA (NoTVLA) framework: a novel approach that narrows its focus to sparse trajectories, thereby avoiding the catastrophic forgetting associated with dense trajectory fine-tuning. A key innovation of NoTVLA lies in its trajectory planning strategy: instead of centering on the target object's trajectory, it leverages temporal compression and spatial reasoning pruning specifically for the robot end effector's trajectory. Furthermore, training is conducted using these sparse trajectories rather than dense action trajectories, an optimization that delivers remarkable practical advantages with better performance in zero-shot. In multi-task evaluation scenarios, NoTVLA achieves superior performance and generalization compared to pi0 while operating under two critical constraints: it uses over an order of magnitude less computing power than pi0 and requires no wrist-mounted camera. This design ensures that NoTVLA's operational accuracy closely approximates that of single-task expert models. Crucially, it also preserves the model's inherent language capabilities, enabling zero-shot generalization in specific scenarios, supporting unified model deployment across multiple robot platforms, and fostering a degree of generalization even when perceiving tasks from novel perspectives.

Paper Structure

This paper contains 36 sections, 16 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Overview of the NoTVLA framework. NoTVLA addresses catastrophic forgetting in Vision-Language-Action (VLA) models by replacing dense action trajectories with sparse, semantically aligned keyframes. Large-scale trajectory data (8.1K trajectories, 100K keyframes) are processed via kinematics-based keyframe and sub-keyframe selection. Instructions and RGB inputs are encoded with Qwen VL 2.5 (7B) to predict anchor points, which are combined with depth queries for anchor-conditioned token generation. A spline-based detokenizer converts these discrete action tokens into smooth, high-frequency trajectories for closed-loop robot control. The framework generalizes across robot embodiments and tasks, supporting zero-shot execution, multi-view robustness, and real-world deployment.
  • Figure 2: Anchor prediction and token generation in NoTVLA. Instructions with RGBD input yield 2D and depth anchors, which condition action tokens. These tokens are converted into trajectories aligned with predicted anchors for precise manipulation.
  • Figure 3: Keyframe selection based on gripper posture. The images show the gripper's movements over time, with keyframes selected based on the open/close states of the left and right arms. The blue and red lines represent the gripper's trajectory, with the corresponding open and close states annotated, highlighting the transitions between key poses during object manipulation.
  • Figure 4: Training steps and average success rate of different works. NoTVLA uses 7B weight, close to RDT and $\pi_0$. The bubble size refer to the single task training step of different works. NoTVLA without single task training still performs better than other models.
  • Figure 5: Franka Operation in the RoboTwin Simulation
  • ...and 4 more figures