Table of Contents
Fetching ...

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

Ilia Larchenko, Gleb Zarin, Akash Karnatak

TL;DR

This work tackles long-horizon embodied AI by extending a Vision-Language-Action model (Pi0.5) with correlated noise in flow matching, System 2 stage tracking, and learnable mixed-layer attention to handle non-Markovian, multi-task BEHAVIOR tasks. It combines task-specific embeddings, delta-action spaces, and multi-sample training to improve efficiency and robustness, and introduces inference-time corrections, including correlation-aware inpainting and action compression, achieving 26% q-score across 50 tasks. The approach emphasizes multi-task learning, recovery behaviors, and practical heuristics (gripper correction rules) to address data gaps and recovery from errors, while acknowledging the need for further ablations and longer training. Overall, the method demonstrates a strong practical path toward robust, long-horizon robotic control in complex, diverse tasks, while highlighting remaining challenges in dexterity and generalization.

Abstract

We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

TL;DR

This work tackles long-horizon embodied AI by extending a Vision-Language-Action model (Pi0.5) with correlated noise in flow matching, System 2 stage tracking, and learnable mixed-layer attention to handle non-Markovian, multi-task BEHAVIOR tasks. It combines task-specific embeddings, delta-action spaces, and multi-sample training to improve efficiency and robustness, and introduces inference-time corrections, including correlation-aware inpainting and action compression, achieving 26% q-score across 50 tasks. The approach emphasizes multi-task learning, recovery behaviors, and practical heuristics (gripper correction rules) to address data gaps and recovery from errors, while acknowledging the need for further ablations and longer training. Overall, the method demonstrates a strong practical path toward robust, long-horizon robotic control in complex, diverse tasks, while highlighting remaining challenges in dexterity and generalization.

Abstract

We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.

Paper Structure

This paper contains 85 sections, 15 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Overall architecture of our system. The vision backbone (SigLIP) processes images from three cameras, PaliGemma VLM processes visual tokens along with task embeddings and robot state, and the Action Expert predicts actions via flow matching. Task embeddings replace language processing, and System 2 provides stage information for non-Markovian context.
  • Figure 2: Examples of visually similar but semantically different states. (a,b) Radio task: lifting vs. placing back. (c,d) Popcorn task: opening microwave vs. pressing button after loading. Without stage tracking, the model confuses these states.
  • Figure 3: Learned KV transformation weights showing deviation from initialization (identity). Each row corresponds to an action expert layer, each column to a VLM layer. Brighter colors indicate higher deviation.
  • Figure 4: Attention mask structure during training. Blue indicates bidirectional attention, white indicates no attention, and diagonal patterns indicate causal attention.
  • Figure 5: Action correlation matrix $\boldsymbol{\Sigma} \in \mathbb{R}^{690 \times 690}$ computed from training data (30 timesteps $\times$ 23 action dimensions = 690). The matrix shows strong block-diagonal structure indicating high temporal correlation and significant cross-dimensional correlation.
  • ...and 5 more figures