Table of Contents
Fetching ...

ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

Guoheng Sun, Tingting Du, Kaixi Feng, Chenxiang Luo, Xingguo Ding, Zheyu Shen, Ziyao Wang, Yexiao He, Ang Li

TL;DR

ROCKET is introduced, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another via a layer-invariant mapping, which reduces gradient conflicts.

Abstract

Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at https://github.com/CASE-Lab-UMD/ROCKET-VLA.

ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

TL;DR

ROCKET is introduced, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another via a layer-invariant mapping, which reduces gradient conflicts.

Abstract

Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at https://github.com/CASE-Lab-UMD/ROCKET-VLA.
Paper Structure (86 sections, 1 theorem, 49 equations, 15 figures, 11 tables)

This paper contains 86 sections, 1 theorem, 49 equations, 15 figures, 11 tables.

Key Result

Theorem 8.1

Consider the Pre-LN student eq:preln_residual with multi-layer distillation eq:dist_loss. Under residual-smallness eq:assumption_residual_small, the gradient at an earlier representation $h_i$ admits the superposition form eq:grad_superposition. For the shared projector parameterization ($\phi_l\equ

Figures (15)

  • Figure 1: Cosine similarity between gradients induced by different alignment losses in the VLA's shallow layers. Red indicates gradient coherence, while Blue indicates gradient interference. Detailed results are provided in Fig. \ref{['fig:grad_sim_early']} and Fig. \ref{['fig:grad_sim_late']}.
  • Figure 2: Overview of ROCKET. On the left, we select a sample from LIBERO to showcase ROCKET's performance. On the right, we use the outputs from VGGT layers $\{1, 3, 9, 15, 21\}$ to directly predict the depth map, demonstrating that different layers contain rich 3D information. Some of the illustrations were generated by GPT-4o hurst2024gpt.
  • Figure 3: Performance on LIBERO across different training stages. See Table \ref{['tab:libero_training_steps']} for details.
  • Figure 4: A cone-effect perspective on alignment.
  • Figure 5: Success rate (%) on LIBERO-Plus. Results are reported under the seven perturbations defined by LIBERO-Plus. Each column shows the average success rate over four task groups: Spatial, Object, Goal, and Long. Detailed results are provided in Appendix \ref{['app:libero_plus_details']}.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Theorem 8.1: Shared projector induces coherent cross-layer interference under Pre-LN transport