Table of Contents
Fetching ...

MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption

Chen Li, Zhantao Yang, Han Zhang, Fangyi Chen, Chenchen Zhu, Anudeepsekhar Bolimera, Marios Savvides

TL;DR

MetaVLA tackles inefficiencies and brittleness in Vision-Language-Action models by introducing Context-Aware Meta Co-Training, a memory-augmented meta-learning framework that unifies target LIBERO tasks with diverse auxiliary data in a backbone-agnostic post-training setting. Central to the approach is MAR, an Attentive Neural Process–inspired module that conditions action decoding on context via self- and cross-attention, modeling $p(oldsymbol{y}_{T}|oldsymbol{x}_{T}, oldsymbol{r}_{T}, z)$ with KL regularization. On LIBERO, MetaVLA with six auxiliary tasks outperforms baselines, reduces training steps from $240{,}000$ to $75{,}000$, and cuts GPU time by about $76\\%$, while maintaining only a minor inference overhead of $0.3$ ms/token. These results demonstrate scalable, low-resource post-training for general-purpose embodied agents, enabling faster convergence and better cross-task generalization. Code will be available to facilitate adoption and extension.

Abstract

Vision-Language-Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists-they often require task-specific fine-tuning, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism-derived from Attentive Neural Processes-to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76%. These results show that scalable, low-resource post-training is achievable-paving the way toward general-purpose embodied agents. Code will be available.

MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption

TL;DR

MetaVLA tackles inefficiencies and brittleness in Vision-Language-Action models by introducing Context-Aware Meta Co-Training, a memory-augmented meta-learning framework that unifies target LIBERO tasks with diverse auxiliary data in a backbone-agnostic post-training setting. Central to the approach is MAR, an Attentive Neural Process–inspired module that conditions action decoding on context via self- and cross-attention, modeling with KL regularization. On LIBERO, MetaVLA with six auxiliary tasks outperforms baselines, reduces training steps from to , and cuts GPU time by about , while maintaining only a minor inference overhead of ms/token. These results demonstrate scalable, low-resource post-training for general-purpose embodied agents, enabling faster convergence and better cross-task generalization. Code will be available to facilitate adoption and extension.

Abstract

Vision-Language-Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists-they often require task-specific fine-tuning, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism-derived from Attentive Neural Processes-to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76%. These results show that scalable, low-resource post-training is achievable-paving the way toward general-purpose embodied agents. Code will be available.

Paper Structure

This paper contains 35 sections, 2 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Three Key Merits of MetaVLA Compared to Baseline Approaches.
  • Figure 2: MetaVLA Architecture. VLA backbone married with Context-Aware Meta Co-Training Framework, where the context memory bank is composed of both in-domain target tasks and out-of-domain auxiliary tasks.
  • Figure 3: Comparison between auxiliary tasks and LIBERO evaluation benchmark. LIBERO tasks use third-person front-view images and 7-DoF actions for a single-arm robot. In contrast, our auxiliary data from GR00T introduces variation through side-view observations and a two-arm robot with 14-DoF actions. MetaVLA benefits from this data diversity, while OpenVLA struggles with the domain mismatch.
  • Figure 4: Left: Per-suite LIBERO success rate across varying context batch sizes.OpenVLA refers to the four Hugging Face baseline models, each fine-tuned individually on LIBERO using the OpenVLA-7B backbone, while SFT-4LIBERO is a single-model baseline trained with vanilla multi-task SFT across all suites. For each suite, success rate increases monotonically with context batch size. Right: Average success rate across LIBERO suites with varying context batch sizes.OpenVLA denotes the four Hugging Face models baselines fine-tuned individually on LIBERO with the OpenVLA-7B backbone, while SFT-4LIBERO is a single-model baseline trained with vanilla multi-task SFT across all suites. $b_c$ indicates the context batch size. Larger context batches consistently yield higher average success rates.
  • Figure 5: Training convergence comparison for models trained with 75K steps. Training Accuracy, Imitation Loss, and L1 Loss are compared between MetaVLA variants and SFT-4LIBERO under different auxiliary-task settings. All MetaVLA variants consistently converges to superior performance across all three metrics, while SFT-4LIBERO fails to adapt effectively—highlighting the robustness and scalability of our approach.
  • ...and 8 more figures