Table of Contents
Fetching ...

VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Mo Guang, Kaiwen Long, Xinlei Chen, Yi Wu, Chao Yu, Yu Wang

TL;DR

VS-Bench addresses the lack of multimodal evaluation for Vision-Language Models in multi-agent settings by introducing ten vision-grounded environments spanning cooperative, competitive, and mixed-motive dynamics. It formalizes the problem within a Partially Observable Markov Game framework and evaluates models along perception, strategic reasoning (next-action prediction), and decision-making (normalized return). Across fifteen VLMs, results show strong perceptual capabilities but substantial gaps in theory-of-mind and long-horizon planning, with the top model achieving approximately 46.6% reasoning accuracy and 31.4% normalized return. The benchmark provides a standardized, open platform for diagnosing failures, analyzing social behaviors, and guiding future progress toward robust, strategic multimodal agents with real-world relevance.

Abstract

Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and textual contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return. We further analyze the key factors influencing performance, conduct human experiments, and examine failure modes to provide a deeper understanding of VLMs' strategic abilities. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs-bench.github.io.

VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

TL;DR

VS-Bench addresses the lack of multimodal evaluation for Vision-Language Models in multi-agent settings by introducing ten vision-grounded environments spanning cooperative, competitive, and mixed-motive dynamics. It formalizes the problem within a Partially Observable Markov Game framework and evaluates models along perception, strategic reasoning (next-action prediction), and decision-making (normalized return). Across fifteen VLMs, results show strong perceptual capabilities but substantial gaps in theory-of-mind and long-horizon planning, with the top model achieving approximately 46.6% reasoning accuracy and 31.4% normalized return. The benchmark provides a standardized, open platform for diagnosing failures, analyzing social behaviors, and guiding future progress toward robust, strategic multimodal agents with real-world relevance.

Abstract

Recent advancements in Vision Language Models (VLMs) have expanded their capabilities to interactive agent tasks, yet existing benchmarks remain limited to single-agent or text-only environments. In contrast, real-world scenarios often involve multiple agents interacting within rich visual and textual contexts, posing challenges with both multimodal observations and strategic interactions. To bridge this gap, we introduce Visual Strategic Bench (VS-Bench), a multimodal benchmark that evaluates VLMs for strategic abilities in multi-agent environments. VS-Bench comprises ten vision-grounded environments that cover cooperative, competitive, and mixed-motive interactions. The performance of VLM agents is evaluated across three dimensions: perception measured by element recognition accuracy; strategic reasoning measured by next-action prediction accuracy; and decision-making measured by normalized episode return. Extensive experiments on fifteen leading VLMs show that, although current models exhibit strong perception abilities, there remains a significant gap to optimal performance in reasoning and decision-making, with the best-performing model attaining 46.6% prediction accuracy and 31.4% normalized return. We further analyze the key factors influencing performance, conduct human experiments, and examine failure modes to provide a deeper understanding of VLMs' strategic abilities. By standardizing the evaluation and highlighting the limitations of existing models, we envision VS-Bench as a foundation for future research on strategic multimodal agents. Code and data are available at https://vs-bench.github.io.

Paper Structure

This paper contains 79 sections, 29 figures, 11 tables.

Figures (29)

  • Figure 1: Evaluation results of fifteen state-of-the-art VLMs on strategic reasoning and decision-making averaged over ten multi-agent environments in VS-Bench.
  • Figure 2: Overview of VS-Bench, a multimodal benchmark for evaluating VLMs in multi-agent environments. We evaluate fifteen models in ten vision-grounded environments across three dimensions, including perception measured by element recognition accuracy, strategic reasoning measured by next-action prediction accuracy, and decision-making measured by normalized episode return.
  • Figure 3: Comparison of reasoning VLMs on decision-making with multimodal and text-only observations. The solid and dashed vertical lines represent the average results of two settings.
  • Figure 4: Comparison of reasoning VLMs and chat VLMs on decision-making with IO and CoT prompting. The solid, dashed, and dotted vertical lines represent the average results of three settings.
  • Figure 5: Social behaviors of o3 with different personas and the best-performing open-source model in each social dilemma game. Dimensions are agents' behaviors described in Appendix \ref{['app:envs']}.
  • ...and 24 more figures