Table of Contents
Fetching ...

AnchorVLA: Anchored Diffusion for Efficient End-to-End Mobile Manipulation

Jia Syuen Lim, Zhizhen Zhang, Peter Bohm, Brendan Tidd, Zi Huang, Yadan Luo

Abstract

A central challenge in mobile manipulation is preserving multiple plausible action models while remaining reactive during execution. A bottle in a cluttered scene can often be approached and grasped in multiple valid ways. Robust behavior depends on preserving this action diversity while remaining reactive as the scene evolves. Diffusion policies are appealing because they model multimodal action distributions rather than collapsing to one solution. But in practice, full iterative denoising is costly at control time. Action chunking helps amortize inference, yet it also creates partially open-loop behavior, allowing small mismatches to accumulate into drift. We present AnchorVLA, a diffusion-based VLA policy for mobile manipulation built on the core insight that when sampling begins near a plausible solution manifold, extensive denoising is unnecessary to recover multimodal, valid actions. AnchorVLA combines a lightweight VLA adaptation backbone with an anchored diffusion action head, which denoises locally around anchor trajectories using a truncated diffusion schedule. This retains multimodal action generation while reducing inference cost for closed-loop control. Crucially, to mitigate chunking-induced drift, we introduce a test-time self-correction mechanism via a lightweight residual correction module that makes high-frequency, per-step adjustments during rollout. Across diverse mobile manipulation tasks, AnchorVLA improves success and stability under disturbances and distribution shifts while maintaining low-latency inference. The source code is made available at https://github.com/jason-lim26/AnchorVLA.

AnchorVLA: Anchored Diffusion for Efficient End-to-End Mobile Manipulation

Abstract

A central challenge in mobile manipulation is preserving multiple plausible action models while remaining reactive during execution. A bottle in a cluttered scene can often be approached and grasped in multiple valid ways. Robust behavior depends on preserving this action diversity while remaining reactive as the scene evolves. Diffusion policies are appealing because they model multimodal action distributions rather than collapsing to one solution. But in practice, full iterative denoising is costly at control time. Action chunking helps amortize inference, yet it also creates partially open-loop behavior, allowing small mismatches to accumulate into drift. We present AnchorVLA, a diffusion-based VLA policy for mobile manipulation built on the core insight that when sampling begins near a plausible solution manifold, extensive denoising is unnecessary to recover multimodal, valid actions. AnchorVLA combines a lightweight VLA adaptation backbone with an anchored diffusion action head, which denoises locally around anchor trajectories using a truncated diffusion schedule. This retains multimodal action generation while reducing inference cost for closed-loop control. Crucially, to mitigate chunking-induced drift, we introduce a test-time self-correction mechanism via a lightweight residual correction module that makes high-frequency, per-step adjustments during rollout. Across diverse mobile manipulation tasks, AnchorVLA improves success and stability under disturbances and distribution shifts while maintaining low-latency inference. The source code is made available at https://github.com/jason-lim26/AnchorVLA.

Paper Structure

This paper contains 24 sections, 14 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of different action generation policies for manipulation. Left: Tabletop manipulation is often close to single-modal, where $L_1$-based regression in VLA is sufficient. Right: Mobile manipulation is inherently multimodal; $L_1$-based policies tend to average across modes and become infeasible, while diffusion-based policies are more expressive but prone to stochastic drift. AnchorVLA captures multimodal behaviors while remaining stable and executable.
  • Figure 2: Overview of the AnchorVLA framework.(Top) Training: A LoRA fine-tuned VLM is used to extract rich semantic and spatial features from multimodal inputs to condition the generative DiT policy via bridge attention. To effectively model the multi-modal distribution of expert demonstrations, the DiT first performs iterative denoising on a set of anchored trajectories. Once denoised, a scoring head evaluates these predictions against the ground truth, selecting the optimal denoised trajectory $(\tau^*)$ closest to the expert demonstration for reconstruction. (Bottom) Inference: During the execution of temporally extended action chunks, minor kinematics deviations naturally accumulate into open-loop execution drift (red), which can lead to task failures. To mitigate this, a dedicated residual correction module dynamically predicts state-dependent micro-adjustments $(r_\psi)$. These continuous refinements adapt the primary macro-trajectory into a precise, corrected trajectory (blue-dashed), ensuring successful end-to-end mobile manipulation.
  • Figure 3: Experimental Environment. Simulation experiments are conducted in ManiSkill-HAB. Real-world experiments are evaluated on the Unitree Go2 quadruped equipped with an SO101 arm. The tasks need the robot to first navigate to the target place, then interact with the objects. The perception module includes two camera views, i.e., base view from Go2 and wrist view from SO101. The data collection process was conducted by teleoperation with the Meta Quest 3 VR device.
  • Figure 4: Effect of chunked execution horizon on robustness and efficiency. Increasing $H$ reduces VLA query frequency and per-episode compute (, from 641.4 TFLOP at H=1 to 128.29 TFLOP at H=5), but also makes action prediction harder. The deterministic $L_1$ regression baseline degrades sharply as chunk length increases (42.8% $\rightarrow$ 8.8%), whereas AnchorVLA degrades more gradually (50.6% $\rightarrow$ 27.5%) and remains competitive at H=8 and H=10, indicating greater robustness to long-horizon chunked execution.
  • Figure 5: Qualitative visualization of AnchorVLA in ManiSkill-HAB environment. The figure illustrates the task progress (left$\rightarrow$right) for: Pick Bowl and Place Apple. The overlaid white point clouds represent the diverse, multi-modal trajectory proposals generated by our model, showing its ability to capture complex action distributions without suffering from mode-averaging. The colored points highlight the optimal denoised trajectory dynamically selected by the scoring head for execution.
  • ...and 2 more figures