Table of Contents
Fetching ...

Multimodal Safety Evaluation in Generative Agent Social Simulations

Alhim Vera, Karen Sanchez, Carlos Hinojosa, Haidar Bin Hamid, Donghoon Kim, Bernard Ghanem

TL;DR

The paper addresses safety in multimodal generative-agent simulations where agents must reason about safety across text and visuals. It introduces a reproducible framework with a memory-augmented agent architecture, a Plan Revision Layer, and a Judge Agent, plus a dataset of 1,000 multimodal social scenarios. SocialMetrics quantify plan revisions, unsafe-to-safe conversions, and information diffusion to study safety and social dynamics. Experiments across Claude, GPT-4o-mini, and Qwen-VL-2B-Instruct show that while direct multimodal contradictions can be detected, global safety alignment remains challenging, with variable improvements across contexts. This framework enables reproducible research and informs the development of safer multimodal agents in social settings.

Abstract

Can generative agents be trusted in multimodal environments? Despite advances in large language and vision-language models that enable agents to act autonomously and pursue goals in rich settings, their ability to reason about safety, coherence, and trust across modalities remains limited. We introduce a reproducible simulation framework for evaluating agents along three dimensions: (1) safety improvement over time, including iterative plan revisions in text-visual scenarios; (2) detection of unsafe activities across multiple categories of social situations; and (3) social dynamics, measured as interaction counts and acceptance ratios of social exchanges. Agents are equipped with layered memory, dynamic planning, multimodal perception, and are instrumented with SocialMetrics, a suite of behavioral and structural metrics that quantifies plan revisions, unsafe-to-safe conversions, and information diffusion across networks. Experiments show that while agents can detect direct multimodal contradictions, they often fail to align local revisions with global safety, reaching only a 55 percent success rate in correcting unsafe plans. Across eight simulation runs with three models - Claude, GPT-4o mini, and Qwen-VL - five agents achieved average unsafe-to-safe conversion rates of 75, 55, and 58 percent, respectively. Overall performance ranged from 20 percent in multi-risk scenarios with GPT-4o mini to 98 percent in localized contexts such as fire/heat with Claude. Notably, 45 percent of unsafe actions were accepted when paired with misleading visuals, showing a strong tendency to overtrust images. These findings expose critical limitations in current architectures and provide a reproducible platform for studying multimodal safety, coherence, and social dynamics.

Multimodal Safety Evaluation in Generative Agent Social Simulations

TL;DR

The paper addresses safety in multimodal generative-agent simulations where agents must reason about safety across text and visuals. It introduces a reproducible framework with a memory-augmented agent architecture, a Plan Revision Layer, and a Judge Agent, plus a dataset of 1,000 multimodal social scenarios. SocialMetrics quantify plan revisions, unsafe-to-safe conversions, and information diffusion to study safety and social dynamics. Experiments across Claude, GPT-4o-mini, and Qwen-VL-2B-Instruct show that while direct multimodal contradictions can be detected, global safety alignment remains challenging, with variable improvements across contexts. This framework enables reproducible research and informs the development of safer multimodal agents in social settings.

Abstract

Can generative agents be trusted in multimodal environments? Despite advances in large language and vision-language models that enable agents to act autonomously and pursue goals in rich settings, their ability to reason about safety, coherence, and trust across modalities remains limited. We introduce a reproducible simulation framework for evaluating agents along three dimensions: (1) safety improvement over time, including iterative plan revisions in text-visual scenarios; (2) detection of unsafe activities across multiple categories of social situations; and (3) social dynamics, measured as interaction counts and acceptance ratios of social exchanges. Agents are equipped with layered memory, dynamic planning, multimodal perception, and are instrumented with SocialMetrics, a suite of behavioral and structural metrics that quantifies plan revisions, unsafe-to-safe conversions, and information diffusion across networks. Experiments show that while agents can detect direct multimodal contradictions, they often fail to align local revisions with global safety, reaching only a 55 percent success rate in correcting unsafe plans. Across eight simulation runs with three models - Claude, GPT-4o mini, and Qwen-VL - five agents achieved average unsafe-to-safe conversion rates of 75, 55, and 58 percent, respectively. Overall performance ranged from 20 percent in multi-risk scenarios with GPT-4o mini to 98 percent in localized contexts such as fire/heat with Claude. Notably, 45 percent of unsafe actions were accepted when paired with misleading visuals, showing a strong tendency to overtrust images. These findings expose critical limitations in current architectures and provide a reproducible platform for studying multimodal safety, coherence, and social dynamics.

Paper Structure

This paper contains 19 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Overview of the proposed framework for evaluating safety in generative agent environments. The left side illustrates the pipeline: social activity scenarios produce multimodal safe/unsafe plans, which are revised and executed by agents. Metrics such as interaction networks, information spread, conversion rates, and acceptance ratios are logged throughout the simulation. The right side shows the fixed virtual environment where agents (PR, KS, JS, CH, AV) interact.
  • Figure 2: (Left) Four-step pipeline for constructing daily social activity plans: generate unsafe situational categories, expand into hour-by-hour unsafe plans and their corresponding safe one by rewriting each activity, retrieve paired images for each action in both unsafe and safe plans, and apply human verification to finalize safe/unsafe plan pairs. (Right) Examples of safe (green square) and unsafe (red square) action–image pairs generated by the proposed method.
  • Figure 3: Distribution of the $1,000$ unsafe plans across 21 high-level situational categories.
  • Figure 4: Generative agent process with our Plan Revision Layer for supervision and safety evaluation.
  • Figure 5: Agent identity initialization pipeline.
  • ...and 8 more figures