Adaptive Capacity Allocation for Vision Language Action Fine-tuning

Donghoon Kim; Minji Bae; Unghui Nam; Gyeonghun Kim; Suyun Lee; Kyuhong Shim; Byonghyo Shim

Adaptive Capacity Allocation for Vision Language Action Fine-tuning

Donghoon Kim, Minji Bae, Unghui Nam, Gyeonghun Kim, Suyun Lee, Kyuhong Shim, Byonghyo Shim

TL;DR

LoRA-SP (Select-Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity, and improves multi-task success by up to 31.6% over standard LoRA while remaining robust to rank choice.

Abstract

Vision language action models (VLAs) are increasingly used for Physical AI, but deploying a pre-trained VLA model to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the exposed capacity knob, the rank, does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (e.g., $r \in \{4, 8\}$), while spectral analyses indicate VLAs may require much larger ranks (e.g., $r \approx 128$) or near-full rank, a mismatch that worsens in multi-task settings. We present LoRA-SP (Select-Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores $E(k) \ge η$, providing a direct link to approximation error via our spectral analysis. During training, $η$ concentrates energy on a few directions and teaches the router to rely on fewer vectors while preserving accuracy. This yields compact adapters that reduce cross-task interference and improve generalization. On four real-robot manipulation tasks collected on an unseen AgileX PiPER arm, across two VLA backbones ($π_0$ and SmolVLA), LoRA-SP matches or exceeds full fine-tuning with far fewer trainable parameters, and improves multi-task success by up to 31.6% over standard LoRA while remaining robust to rank choice.

Adaptive Capacity Allocation for Vision Language Action Fine-tuning

TL;DR

Abstract

), while spectral analyses indicate VLAs may require much larger ranks (e.g.,

) or near-full rank, a mismatch that worsens in multi-task settings. We present LoRA-SP (Select-Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores

, providing a direct link to approximation error via our spectral analysis. During training,

concentrates energy on a few directions and teaches the router to rely on fewer vectors while preserving accuracy. This yields compact adapters that reduce cross-task interference and improve generalization. On four real-robot manipulation tasks collected on an unseen AgileX PiPER arm, across two VLA backbones (

and SmolVLA), LoRA-SP matches or exceeds full fine-tuning with far fewer trainable parameters, and improves multi-task success by up to 31.6% over standard LoRA while remaining robust to rank choice.

Paper Structure (18 sections, 14 equations, 7 figures, 4 tables)

This paper contains 18 sections, 14 equations, 7 figures, 4 tables.

INTRODUCTION
Theoretical Background and Rationale
Intrinsic Dimension of Fine-tuning
LoRA-MoE and Multi-task Performance
Analysis: Intrinsic Dimension and Spectral Error
Spectral Error in Low Rank Approximation of Gradients
Characteristics of Gradients in VLA Models
LoRA Select--Prune (LoRA-SP)
Overview
Problem Setup and Connection to LoRA
Select: Vector-level Gating
Prune: Active-set Reduction via Spectral Loss
Training Losses
EXPERIMENTS
Experiments Setup
...and 3 more sections

Figures (7)

Figure 1: Rank--performance curves (accuracy/success relative to full fine-tuning; 1.0 = full FT). LLM (LLaMA-7B) reaches near-full-FT performance with very small ranks ($r\in\{4,8\}$), whereas VLA ($\pi_0$-3.5B) improves steadily and only approaches parity around $r\approx128$, consistent with a higher intrinsic dimension in the VLA transfer setting.
Figure 2: VLA models are pre-trained on diverse manipulation tasks and robot embodiment (e.g., Franka Emika, WidowX, UR5e, Stretch) data. We compare transfers to a seen embodiment (Franka) versus an unseen embodiment (PiPER). Unseen-embodiment transfer changes both the robot’s kinematic specification (DoF, link lengths, joint limits) and the perception geometry (camera intrinsics/extrinsics, viewpoint) and workspace scale, which pushes the update to require a higher intrinsic rank; by contrast, when the embodiment is seen, adaptation primarily compensates for perception/scale shifts and works with a lower rank.
Figure 3: Rank sensitivity in single- and multi-task LoRA fine-tuning on $\pi_0$ model. (a) LoRA modules trained independently on each single task. While single-task modules also require higher ranks to reach full performance, their variance across tasks is lower than in the multi-task setting. Together, the results highlight the difficulty of choosing a single global rank that balances efficiency and accuracy across tasks, motivating rank-adaptive allocation. (b) Multi-task LoRA fine-tuning across four manipulation tasks. Success rate increases with rank but exhibits substantial variance across tasks, reflecting interference and heterogeneous capacity needs.
Figure 4: Spectral Rank Variation by Embodiment During $\pi_0$ Fine-tuning. Number of singular values required to capture 99% of the total energy (normalized by the full rank) across different layers and modules. We compare $\pi_0$ models fine-tuned on in-domain and out-of-domain data. The in-domain model is fine-tuned on the DROID dataset, which uses the robotic arm (Franka Panda) included in $\pi_0$'s pretraining data. The out-of-domain model is fine-tuned on a dataset collected with the AgileX PiPER robotic arm, an embodiment absent from the pretraining data. The results show that the required rank varies by embodiment, and generalizing to a novel embodiment demands higher-rank to achieve comparable performance.
Figure 5: LoRA-SP (Select--Prune). (I) Overview: a wide vector bank $(U,V)$ is trained together with a router on the backbone $W_0$. (II) Select: the router produces vector-level scores that act as singular values, forming an input- and layer-conditioned update $\Delta W = U\,\Sigma(x)\,V$; the histogram illustrates the spectral energy distribution across vectors. (III) Prune: only the smallest set of basis vectors whose cumulative energy exceeds the target $\eta$ are kept, progressively reducing the active rank while maintaining accuracy.
...and 2 more figures

Adaptive Capacity Allocation for Vision Language Action Fine-tuning

TL;DR

Abstract

Adaptive Capacity Allocation for Vision Language Action Fine-tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)