Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge
Ilia Larchenko, Gleb Zarin, Akash Karnatak
TL;DR
This work tackles long-horizon embodied AI by extending a Vision-Language-Action model (Pi0.5) with correlated noise in flow matching, System 2 stage tracking, and learnable mixed-layer attention to handle non-Markovian, multi-task BEHAVIOR tasks. It combines task-specific embeddings, delta-action spaces, and multi-sample training to improve efficiency and robustness, and introduces inference-time corrections, including correlation-aware inpainting and action compression, achieving 26% q-score across 50 tasks. The approach emphasizes multi-task learning, recovery behaviors, and practical heuristics (gripper correction rules) to address data gaps and recovery from errors, while acknowledging the need for further ablations and longer training. Overall, the method demonstrates a strong practical path toward robust, long-horizon robotic control in complex, diverse tasks, while highlighting remaining challenges in dexterity and generalization.
Abstract
We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.
