Table of Contents
Fetching ...

CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

Youzhi Liu, Li Gao, Liu Liu, Mingyang Lv, Yang Cai

Abstract

Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first open-source Habitat-based benchmark protocol and episode set for language-conditioned competitive EVT featuring dynamic dueling, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack-Bench.

CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

Abstract

Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first open-source Habitat-based benchmark protocol and episode set for language-conditioned competitive EVT featuring dynamic dueling, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack-Bench.
Paper Structure (22 sections, 5 equations, 5 figures, 3 tables)

This paper contains 22 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: CoMaTrack frames EVT as a competitive multi-agent game rather than a single-agent pursuit in a static enviroment. (a) IL: the agent learns from offline demonstrations in static scenes, with limited exposure to rare failures. (b) Single-Agent RL: the agent improves through interaction, but the target and environment remain largely fixed, leading to slow exploration and overfitting to predefined behaviors. (c) Multi-Agent RL: the agent trains against adaptive opponents that evade or block on purpose, dynamically increasing difficulty and producing diverse adversarial trajectories, encouraging anticipation, relocalization, and robustness under interference.
  • Figure 2: Overview of the CoMaTrack framework. The system employs an end-to-end VLA architecture built upon the Qwen2.5VL-3B. During the SFT phase, the model learns to predict future trajectories from multi-view observations and historical visual sequences. In the RL phase, the tracker and opponent agents engage in competitive training within a dynamic adversarial environment, co-evolving robust tracking policies through the GRPO algorithm.
  • Figure 3: Hardware Platform. Our deployment platform is built on a Unitree Go2 X quadrupedal robot equipped with four monocular RGB cameras, a Unitree 4D LiDAR L2.
  • Figure 4: Diagram Illustrating the Impact of Opponent Strength on Outcomes.
  • Figure 5: Qualitative real-world results demonstrating CoMaTrack’s zero-shot deployment capabilities.