Table of Contents
Fetching ...

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

Puhao Li, Yingying Wu, Ziheng Xi, Wanlin Li, Yuzhe Huang, Zhiyuan Zhang, Yinghan Chen, Jianan Wang, Song-Chun Zhu, Tengyu Liu, Siyuan Huang

TL;DR

ControlVLA addresses the data-efficiency gap in real-world robotic manipulation by fusing a pre-trained Vision-Language-Action policy with object-centric representations through a ControlNet-style fine-tuning scheme. Object-centric features are grounded using segmentation tools (GroundingDINO, SAM) and injected via zero-initialized KV projections in a dual-attention mechanism, preserving the base policy while enabling task-specific adaptation. Empirically, it achieves 76.7% task success across eight real-world tasks with only 10–20 demonstrations, outpacing baselines by a wide margin and demonstrating robustness to long-horizon tasks and unseen objects/backgrounds. The work further shows data-scaling benefits and provides ablations confirming the necessity of pre-training, object-centric representations, and careful initialization for stable learning, positioning ControlVLA as a practical pathway toward deploying large-scale pre-trained policies in data-scarce, real-world settings.

Abstract

Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations -- a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success. Additional experiments highlight ControlVLA's extensibility to long-horizon tasks and robustness to unseen objects and backgrounds.

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

TL;DR

ControlVLA addresses the data-efficiency gap in real-world robotic manipulation by fusing a pre-trained Vision-Language-Action policy with object-centric representations through a ControlNet-style fine-tuning scheme. Object-centric features are grounded using segmentation tools (GroundingDINO, SAM) and injected via zero-initialized KV projections in a dual-attention mechanism, preserving the base policy while enabling task-specific adaptation. Empirically, it achieves 76.7% task success across eight real-world tasks with only 10–20 demonstrations, outpacing baselines by a wide margin and demonstrating robustness to long-horizon tasks and unseen objects/backgrounds. The work further shows data-scaling benefits and provides ablations confirming the necessity of pre-training, object-centric representations, and careful initialization for stable learning, positioning ControlVLA as a practical pathway toward deploying large-scale pre-trained policies in data-scarce, real-world settings.

Abstract

Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world experiments across 6 diverse tasks, including pouring cubes and folding clothes, our method achieves a 76.7% success rate while requiring only 10-20 demonstrations -- a significant improvement over traditional approaches that require more than 100 demonstrations to achieve comparable success. Additional experiments highlight ControlVLA's extensibility to long-horizon tasks and robustness to unseen objects and backgrounds.

Paper Structure

This paper contains 25 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: ControlVLA bridges pre-trained manipulation policies with object-centric representations via ControlNet-style efficient fine-tuning. ControlVLA requires only 10–20 demonstrations to achieve 76.7% task success rate, significantly surpassing baseline's 20.8% success rate.
  • Figure 2: Overview of ControlVLA. ControlVLA leverages a ControlNet-style fine-tuning strategy to integrate object-centric representations with the pre-trained vla. The zero-initialized weights and biases preserve the rich prior knowledge of the pre-trained policy while progressively grounding it in object-centric representation.
  • Figure 3: Task Visualization. The initial and target states are shown as transparent and solid layers, respectively. The yellow arrow highlights the desired transition.
  • Figure 4: Main Comparison and Ablation Study. All policies are trained or fine-tuned from a shared, limited demonstration dataset for each task. *Octo, ACT, and VIOLA are omitted due to very low success rates, with overall success rates of 1.6%, 5.0%, and 0.0%, respectively.
  • Figure 5: Effect of Data Scaling on Performance in the OrganizeToy Task.
  • ...and 3 more figures