ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

Guoheng Sun; Tingting Du; Kaixi Feng; Chenxiang Luo; Xingguo Ding; Zheyu Shen; Ziyao Wang; Yexiao He; Ang Li

ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

Guoheng Sun, Tingting Du, Kaixi Feng, Chenxiang Luo, Xingguo Ding, Zheyu Shen, Ziyao Wang, Yexiao He, Ang Li

TL;DR

ROCKET is introduced, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another via a layer-invariant mapping, which reduces gradient conflicts.

Abstract

Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at https://github.com/CASE-Lab-UMD/ROCKET-VLA.

ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

TL;DR

Abstract

Paper Structure (86 sections, 1 theorem, 49 equations, 15 figures, 11 tables)

This paper contains 86 sections, 1 theorem, 49 equations, 15 figures, 11 tables.

Introduction
Related Work
Vision-Language-Action Models and Spatial Grounding
Representation Alignment for Spatial Supervision
Multi-layer Distillation and Optimization Challenges
Theoretical Framework: Residual Dynamics and Gradient Coherence
Background
VLA as student and manipulation objective.
3D vision foundation model as teacher.
Single-layer representation alignment.
The Gradient-Conflict of Multi-Layer Alignment
From single-layer to multi-layer alignment: richer cues, but no gains in practice.
Residual-dynamical view: multi-layer alignment should be cone-to-cone, yet learned projectors decouple.
Corollary: early-layer updates are a superposition of future local distillation gradients.
Proposition: Jacobian-induced gradient interference under multiple independent projectors.
...and 71 more sections

Key Result

Theorem 8.1

Consider the Pre-LN student eq:preln_residual with multi-layer distillation eq:dist_loss. Under residual-smallness eq:assumption_residual_small, the gradient at an earlier representation $h_i$ admits the superposition form eq:grad_superposition. For the shared projector parameterization ($\phi_l\equ

Figures (15)

Figure 1: Cosine similarity between gradients induced by different alignment losses in the VLA's shallow layers. Red indicates gradient coherence, while Blue indicates gradient interference. Detailed results are provided in Fig. \ref{['fig:grad_sim_early']} and Fig. \ref{['fig:grad_sim_late']}.
Figure 2: Overview of ROCKET. On the left, we select a sample from LIBERO to showcase ROCKET's performance. On the right, we use the outputs from VGGT layers $\{1, 3, 9, 15, 21\}$ to directly predict the depth map, demonstrating that different layers contain rich 3D information. Some of the illustrations were generated by GPT-4o hurst2024gpt.
Figure 3: Performance on LIBERO across different training stages. See Table \ref{['tab:libero_training_steps']} for details.
Figure 4: A cone-effect perspective on alignment.
Figure 5: Success rate (%) on LIBERO-Plus. Results are reported under the seven perturbations defined by LIBERO-Plus. Each column shows the average success rate over four task groups: Spatial, Object, Goal, and Long. Detailed results are provided in Appendix \ref{['app:libero_plus_details']}.
...and 10 more figures

Theorems & Definitions (1)

Theorem 8.1: Shared projector induces coherent cross-layer interference under Pre-LN transport

ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

TL;DR

Abstract

ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (1)