Table of Contents
Fetching ...

Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

Inderjeet Singh, Vikas Pahuja, Aishvariya Priya Rathina Sabapathy, Chiara Picardi, Amit Giloni, Roman Vainshtein, Andrés Murillo, Hisashi Kojima, Motoyoshi Sekiya, Yuki Unno, Junichi Suga

TL;DR

MMA-RAG^T is introduced, an inference-time control framework governed by a Modular Trust Agent that maintains an approximate belief state via structured LLM reasoning that mediates a configurable set of internal checkpoints to enforce stateful defence-in-depth.

Abstract

Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components. We formulate this security challenge as a Partially Observable Markov Decision Process (POMDP), where adversarial intent is a latent variable inferred from noisy multi-stage observations. We introduce MMA-RAG^T, an inference-time control framework governed by a Modular Trust Agent (MTA) that maintains an approximate belief state via structured LLM reasoning. Operating as a model-agnostic overlay, MMA-RAGT mediates a configurable set of internal checkpoints to enforce stateful defence-in-depth. Extensive evaluation on 43,774 instances demonstrates a 6.50x average reduction factor in Attack Success Rate relative to undefended baselines, with negligible utility cost. Crucially, a factorial ablation validates our theoretical bounds: while statefulness and spatial coverage are individually necessary (26.4 pp and 13.6 pp gains respectively), stateless multi-point intervention can yield zero marginal benefit under homogeneous stateless filtering when checkpoint detections are perfectly correlated.

Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG

TL;DR

MMA-RAG^T is introduced, an inference-time control framework governed by a Modular Trust Agent that maintains an approximate belief state via structured LLM reasoning that mediates a configurable set of internal checkpoints to enforce stateful defence-in-depth.

Abstract

Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components. We formulate this security challenge as a Partially Observable Markov Decision Process (POMDP), where adversarial intent is a latent variable inferred from noisy multi-stage observations. We introduce MMA-RAG^T, an inference-time control framework governed by a Modular Trust Agent (MTA) that maintains an approximate belief state via structured LLM reasoning. Operating as a model-agnostic overlay, MMA-RAGT mediates a configurable set of internal checkpoints to enforce stateful defence-in-depth. Extensive evaluation on 43,774 instances demonstrates a 6.50x average reduction factor in Attack Success Rate relative to undefended baselines, with negligible utility cost. Crucially, a factorial ablation validates our theoretical bounds: while statefulness and spatial coverage are individually necessary (26.4 pp and 13.6 pp gains respectively), stateless multi-point intervention can yield zero marginal benefit under homogeneous stateless filtering when checkpoint detections are perfectly correlated.
Paper Structure (59 sections, 2 theorems, 4 equations, 2 figures, 5 tables, 2 algorithms)

This paper contains 59 sections, 2 theorems, 4 equations, 2 figures, 5 tables, 2 algorithms.

Key Result

Proposition 1

Let $\Pi_\Omega = \{\pi : \Omega \to \mathcal{A}\}$ denote the set of stateless (observation-memoryless) policies and $\Pi_\Phi = \{\pi : \Omega \times \Phi \to \mathcal{A}\}$ denote belief-conditioned policies. Define the value function $V(\Pi) \triangleq \sup_{\pi \in \Pi}\, \mathbb{E}_\pi[J]$. Th and under partial observability the inequality can be strict, e.g., when there exist histories $h_t

Figures (2)

  • Figure 1: The MMA-RAG$^{\textsc{T}}$ architecture. The MTA ($\mathcal{A}_{\mathrm{trust}}$) intercepts artifacts at a set of checkpoints (C1-C5 in our evaluated instantiation), executes Algorithm \ref{['alg:mta']} at each, and maintains a cumulative belief state $\Phi_t$. Decisions $\delta_t \!\in\! \{\textup{Approve}, \textup{Mitigate}, \textup{Refuse}\}$ gate artifact passage through the pipeline.
  • Figure 2: Experimental results overview.(a) ASR across all ART-SafeBench benchmarks: the MTA reduces ASR on every surface, with reduction factors ranging from $1.3\times$ (B4, tool-flip) to $14.4\times$ (B3, direct query); mean factor $6.50\times$. (b) Factorial ablation on B1: statefulness contributes $-26.4$ pp and multi-stage coverage $-13.6$ pp; the two mechanisms are individually necessary and their combination yields super-additive gains, validating Propositions \ref{['prop:memory']} and \ref{['prop:correlation']}. (c) Cross-LLM generalisation on B1: defence transfers across four backbones ($2.5\!\times$-$9.5\!\times$ reduction), with invariant relative difficulty ordering.

Theorems & Definitions (3)

  • Definition 1: Adversarial Agentic RAG POMDP
  • Proposition 1: Strict Value of Memory
  • Proposition 2: Checkpoint Correlation Structure