CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

Youzhi Liu; Li Gao; Liu Liu; Mingyang Lv; Yang Cai

CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

Youzhi Liu, Li Gao, Liu Liu, Mingyang Lv, Yang Cai

Abstract

Embodied Visual Tracking (EVT), a core dynamic task in embodied intelligence, requires an agent to precisely follow a language-specified target. Yet most existing methods rely on single-agent imitation learning, suffering from costly expert data and limited generalization due to static training environments. Inspired by competition-driven capability evolution, we propose CoMaTrack, a competitive game-theoretic multi-agent reinforcement learning framework that trains agents in a dynamic adversarial setting with competitive subtasks, yielding stronger adaptive planning and interference-resilient strategies. We further introduce CoMaTrack-Bench, the first open-source Habitat-based benchmark protocol and episode set for language-conditioned competitive EVT featuring dynamic dueling, featuring game scenarios between a tracker and adaptive opponents across diverse environments and instructions, enabling standardized robustness evaluation under active adversarial interactions. Experiments show that CoMaTrack achieves state-of-the-art results on both standard benchmarks and CoMaTrack-Bench. Notably, a 3B VLM trained with our framework surpasses previous single-agent imitation learning methods based on 7B models on the challenging EVT-Bench, achieving 92.1% in STT, 74.2% in DT, and 57.5% in AT. The benchmark code will be available at https://github.com/wlqcode/CoMaTrack-Bench.

CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

Abstract

Paper Structure (22 sections, 5 equations, 5 figures, 3 tables)

This paper contains 22 sections, 5 equations, 5 figures, 3 tables.

Introduction
Related Work
Visual Language Navigation
Embodied Visual Tracking
Reinforcement Learning in VLN
Methods
Task Formulation
CoMaTrack Overview
Supervised Fine-Tuning
Single-Agent RL
Multi-Agent RL
CoMaTrack Benchmark
Training Recipe and Data Collection
Data Collection
Training Recipe
...and 7 more sections

Figures (5)

Figure 1: CoMaTrack frames EVT as a competitive multi-agent game rather than a single-agent pursuit in a static enviroment. (a) IL: the agent learns from offline demonstrations in static scenes, with limited exposure to rare failures. (b) Single-Agent RL: the agent improves through interaction, but the target and environment remain largely fixed, leading to slow exploration and overfitting to predefined behaviors. (c) Multi-Agent RL: the agent trains against adaptive opponents that evade or block on purpose, dynamically increasing difficulty and producing diverse adversarial trajectories, encouraging anticipation, relocalization, and robustness under interference.
Figure 2: Overview of the CoMaTrack framework. The system employs an end-to-end VLA architecture built upon the Qwen2.5VL-3B. During the SFT phase, the model learns to predict future trajectories from multi-view observations and historical visual sequences. In the RL phase, the tracker and opponent agents engage in competitive training within a dynamic adversarial environment, co-evolving robust tracking policies through the GRPO algorithm.
Figure 3: Hardware Platform. Our deployment platform is built on a Unitree Go2 X quadrupedal robot equipped with four monocular RGB cameras, a Unitree 4D LiDAR L2.
Figure 4: Diagram Illustrating the Impact of Opponent Strength on Outcomes.
Figure 5: Qualitative real-world results demonstrating CoMaTrack’s zero-shot deployment capabilities.

CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

Abstract

CoMaTrack: Competitive Multi-Agent Game-Theoretic Tracking with Vision-Language-Action Models

Authors

Abstract

Table of Contents

Figures (5)