Table of Contents
Fetching ...

MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings

Shahil Shaik, Aditya Parameshwaran, Anshul Nayak, Jonathon M. Smereka, Yue Wang

Abstract

Multi-agent reinforcement learning (MARL) commonly relies on a centralized critic to estimate the value function. However, learning such a critic from scratch is highly sample-inefficient and often lacks generalization across environments. At the same time, large vision-language-action models (VLAs) trained on internet-scale data exhibit strong multimodal reasoning and zero-shot generalization capabilities, yet directly deploying them for robotic execution remains computationally prohibitive, particularly in heterogeneous multi-robot systems with diverse embodiments and resource constraints. To address these challenges, we propose Multi-Agent Vision-Language-Critic Models (MA-VLCM), a framework that replaces the learned centralized critic in MARL with a pretrained vision-language model fine-tuned to evaluate multi-agent behavior. MA-VLCM acts as a centralized critic conditioned on natural language task descriptions, visual trajectory observations, and structured multi-agent state information. By eliminating critic learning during policy optimization, our approach significantly improves sample efficiency while producing compact execution policies suitable for deployment on resource-constrained robots. Results show good zero-shot return estimation on models with differing VLM backbones on in-distribution and out-of-distribution scenarios in multi-agent team settings

MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings

Abstract

Multi-agent reinforcement learning (MARL) commonly relies on a centralized critic to estimate the value function. However, learning such a critic from scratch is highly sample-inefficient and often lacks generalization across environments. At the same time, large vision-language-action models (VLAs) trained on internet-scale data exhibit strong multimodal reasoning and zero-shot generalization capabilities, yet directly deploying them for robotic execution remains computationally prohibitive, particularly in heterogeneous multi-robot systems with diverse embodiments and resource constraints. To address these challenges, we propose Multi-Agent Vision-Language-Critic Models (MA-VLCM), a framework that replaces the learned centralized critic in MARL with a pretrained vision-language model fine-tuned to evaluate multi-agent behavior. MA-VLCM acts as a centralized critic conditioned on natural language task descriptions, visual trajectory observations, and structured multi-agent state information. By eliminating critic learning during policy optimization, our approach significantly improves sample efficiency while producing compact execution policies suitable for deployment on resource-constrained robots. Results show good zero-shot return estimation on models with differing VLM backbones on in-distribution and out-of-distribution scenarios in multi-agent team settings
Paper Structure (22 sections, 17 equations, 6 figures, 2 tables)

This paper contains 22 sections, 17 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overall architecture of the MA-VLCM training (- -) and inference (- -) directions. The MA-VLCM dataset ($\mathcal{D}$) contains multi-modal, multi-agent dataset that is used to send trajectories of vision $\mathbf{I}_t$, textual $\ell$ and agent observation $\mathbf{o_t}$ and actions $\mathbf{a}_t$ data to train the VLCM. The main components that are trained are the GAT, Value head and LoRA adaptors on the vision encoder and language model. The output embeddings from the the encoders $\mathbf{e}_t$ and the GAT are concatentated before sending them to the language model. The output of the value head would be the estimated long-term return of the trajectory of data along with the language prompt sent during inference.
  • Figure 2: (a) The Robotic Warehouse (RWARE) 3D environment created in Isaac Sim. (b) A rendering of rware-4ag grid environment with 4 agents and 4 requested boxes (in green) describing the rasterized semantic image, (c) and its equivalent BEV camera image collected from the same environment shown in (a).
  • Figure 3: An unstructured offroad environment created in high-fidelity simulator with 3 Clearpath Jackal robots (left), and a corresponding rasterized top-down view (right), with the agents represented as color-coded markers moving to their target location on a traversability map generated from the high-fidelity simulator.
  • Figure 4: Value Estimation on IID vs OOD for RWARE environment with 0.5B VLCM
  • Figure 5: Value Estimation on IID vs OOD for Offroad environment with 0.5B VLCM
  • ...and 1 more figures