Table of Contents
Fetching ...

PVI: Plug-in Visual Injection for Vision-Language-Action Models

Zezhou Zhang, Songxin Zhang, Xiao Xiong, Junjie Zhang, Zejian Xie, Jingyi Xi, Zunyao Mao, Zan Mao, Zhixin Mai, Zhuoyang Song, Jiaxing Zhang

Abstract

VLA architectures that pair a pretrained VLM with a flow-matching action expert have emerged as a strong paradigm for language-conditioned manipulation. Yet the VLM, optimized for semantic abstraction and typically conditioned on static visual observations, tends to attenuate fine-grained geometric cues and often lacks explicit temporal evidence for the action expert. Prior work mitigates this by injecting auxiliary visual features, but existing approaches either focus on static spatial representations or require substantial architectural modifications to accommodate temporal inputs, leaving temporal information underexplored. We propose Plug-in Visual Injection (PVI), a lightweight, encoder-agnostic module that attaches to a pretrained action expert and injects auxiliary visual representations via zero-initialized residual pathways, preserving pretrained behavior with only single-stage fine-tuning. Using PVI, we obtain consistent gains over the base policy and a range of competitive alternative injection strategies, and our controlled study shows that temporal video features (V-JEPA2) outperform strong static image features (DINOv2), with the largest gains on multi-phase tasks requiring state tracking and coordination. Real-robot experiments on long-horizon bimanual cloth folding further demonstrate the practicality of PVI beyond simulation.

PVI: Plug-in Visual Injection for Vision-Language-Action Models

Abstract

VLA architectures that pair a pretrained VLM with a flow-matching action expert have emerged as a strong paradigm for language-conditioned manipulation. Yet the VLM, optimized for semantic abstraction and typically conditioned on static visual observations, tends to attenuate fine-grained geometric cues and often lacks explicit temporal evidence for the action expert. Prior work mitigates this by injecting auxiliary visual features, but existing approaches either focus on static spatial representations or require substantial architectural modifications to accommodate temporal inputs, leaving temporal information underexplored. We propose Plug-in Visual Injection (PVI), a lightweight, encoder-agnostic module that attaches to a pretrained action expert and injects auxiliary visual representations via zero-initialized residual pathways, preserving pretrained behavior with only single-stage fine-tuning. Using PVI, we obtain consistent gains over the base policy and a range of competitive alternative injection strategies, and our controlled study shows that temporal video features (V-JEPA2) outperform strong static image features (DINOv2), with the largest gains on multi-phase tasks requiring state tracking and coordination. Real-robot experiments on long-horizon bimanual cloth folding further demonstrate the practicality of PVI beyond simulation.
Paper Structure (27 sections, 6 equations, 5 figures, 8 tables)

This paper contains 27 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Overview of Plug-in Visual Injection (PVI). Typical VLAs condition the VLM on static images with language, providing limited temporal context; moreover, the VLM’s output representations may under-emphasize fine-grained geometric cues. PVI bypasses this bottleneck by injecting auxiliary visual representations directly into the frozen action expert via a trainable plug-in, with no backbone modification and no re-pretraining required.
  • Figure 2: Architecture overview. The frozen main DiT blocks receive semantic embeddings from a frozen VLM. A trainable DiT copy (PVI) conditions on auxiliary visual features and injects them into the main stream via zero-initialized linear projections to produce continuous actions.
  • Figure 3: Overview of PVI and four candidate strategies for injecting auxiliary visual features into the DiT action expert. We compare PVI (ours), which injects V-JEPA2 features via a trainable copy branch with zero-initialized injection layers, against input-level fusion (Concat), attention-level dual injection (ControlVLA-style), and parallel-branch designs with feature concatenation or residual addition (ReferenceNet- and ControlNet-style).
  • Figure 4: Encoder-agnostic visual injection and representation comparison. Per-task success rates and overall average across 10 simulation tasks, comparing a fine-tuned GR00T N1.5 baseline against three PVI instantiations with different auxiliary encoders.
  • Figure 5: PVI enables long-horizon bimanual manipulation of deformable objects on real hardware. The task comprises eight sequential subtasks, each illustrated by three keyframes: sleeve localization and precise pinching (Step 1), global shape adjustment via dragging without wrinkling (Step 2), symmetric bimanual sleeve folding (Step 3), long-horizon edge-to-edge vertical half-fold (Step 4), contact-rich sliding with tension control (Step 5), asymmetric anchoring and collar lifting (Step 6), multi-stage final fold with shape retention (Step 7), and stable lift-and-stack (Step 8). All subtasks are driven by a single PVI-augmented policy conditioned on language instructions, without task-specific engineering or manual resets between steps.