Table of Contents
Fetching ...

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

Ce Zhang, Jinxi He, Junyi He, Katia Sycara, Yaqi Xie

Abstract

Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image-text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code are available at https://echosafe-mllm.github.io.

Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

Abstract

Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image-text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code are available at https://echosafe-mllm.github.io.
Paper Structure (26 sections, 7 equations, 11 figures, 9 tables)

This paper contains 26 sections, 7 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Comparison of different approaches for enhancing MLLM safety. (a) Qualitative comparison of generated responses: prior methods wang2024adashieldgong2025figstep often exhibit over-defensive behavior, whereas our EchoSafe produces contextually appropriate responses; (b) Quantitative comparison on MM-SafetyBench++: EchoSafe consistently outperforms prior methods in both contextual correctness rate (CCR) and response quality score (QS).
  • Figure 2: An overview of our proposed EchoSafe framework. At each inference step $t$, the model retrieves the top-$k$ most relevant safety insights from the memory bank $\mathcal{M}^{(t-1)}$ based on contextual similarity. The retrieved insights serve as prior safety guidance for responding to the current query. After generating a response, the model performs self-reflection to derive a new safety insight $I^{(t)}$, which is added into the memory together with its corresponding context embedding $\mathbf{e}^{(t)}$ to enable continual evolution.
  • Figure 3: Results on MM-SafetyBench++ using Qwen-2.5-VL with and without memory accumulation. Bar plots represent the contextual correctness rate, while circular markers indicate quality scores. $\Delta$ annotations above the bars highlight the relative gains achieved through memory accumulation across categories.
  • Figure 4: Efficiency comparison using Qwen-2.5-VL bai2025qwen2.5. We present the average inference time, FLOPs (represented by bubble size), and average contextual correctness rate.
  • Figure A1: Illustrative samples drawn from our MM-SafetyBench++. For each scenario, we show a paired unsafe and safe sample that differ only in the user intent while preserving similar visual contexts. The unsafe subset contains harmful requests (e.g., police impersonation, hate-speech content generation, DDoS development, invasion planning, client deception, or initiating sexually explicit conversations), whereas the safe subset provides benign alternatives aligned with the same contextual themes (e.g., identity verification, respectful communication, defensive cybersecurity training, defensive preparation, ethical client engagement, or healthy online discussions).
  • ...and 6 more figures