Table of Contents
Fetching ...

To Deceive is to Teach? Forging Perceptual Robustness via Adversarial Reinforcement Learning

Yicheng Bao, Xuhong Wang, Xin Tan

TL;DR

This work introduces AOT-SFT, a large-scale adversarial dataset for bootstrapping MLLM robustness and proposes AOT (Adversarial Opponent Training), a self-play framework that forges MLLM robustness by creating its own training data.

Abstract

Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) exhibit perceptual fragility when confronted with visually complex scenes. This weakness stems from a reliance on finite training datasets, which are prohibitively expensive to scale and impose a ceiling on model robustness. We introduce \textbf{AOT-SFT}, a large-scale adversarial dataset for bootstrapping MLLM robustness. Building on this, we propose \textbf{AOT (Adversarial Opponent Training)}, a self-play framework that forges MLLM robustness by creating its own training data. Our method orchestrates a co-evolution between an image-editing Attacker and a Defender MLLM, where the Attacker generates a diverse and dynamic curriculum of image manipulations, forcing the Defender to adapt and improve. Extensive experiments demonstrate that AOT enhances the Defender's perceptual robustness and reduces hallucinations, establishing a scalable paradigm for training more reliable MLLMs.

To Deceive is to Teach? Forging Perceptual Robustness via Adversarial Reinforcement Learning

TL;DR

This work introduces AOT-SFT, a large-scale adversarial dataset for bootstrapping MLLM robustness and proposes AOT (Adversarial Opponent Training), a self-play framework that forges MLLM robustness by creating its own training data.

Abstract

Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) exhibit perceptual fragility when confronted with visually complex scenes. This weakness stems from a reliance on finite training datasets, which are prohibitively expensive to scale and impose a ceiling on model robustness. We introduce \textbf{AOT-SFT}, a large-scale adversarial dataset for bootstrapping MLLM robustness. Building on this, we propose \textbf{AOT (Adversarial Opponent Training)}, a self-play framework that forges MLLM robustness by creating its own training data. Our method orchestrates a co-evolution between an image-editing Attacker and a Defender MLLM, where the Attacker generates a diverse and dynamic curriculum of image manipulations, forcing the Defender to adapt and improve. Extensive experiments demonstrate that AOT enhances the Defender's perceptual robustness and reduces hallucinations, establishing a scalable paradigm for training more reliable MLLMs.
Paper Structure (48 sections, 3 equations, 13 figures, 11 tables)

This paper contains 48 sections, 3 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: An illustration of the perceptual fragility in existing MLLMs and the robustness fostered by our co-evolutionary framework. Left (Co-evolution Concept): A conceptual depiction of the iterative competition between an Attacker and a Defender model, where each model's capabilities are progressively enhanced. Right (Practical Demonstration): In a simple scene, both a standard MLLM and our Defender can correctly identify the spatial relation, but our model demonstrates superior, detailed perception. When a contextual distractor is introduced, the standard MLLM is perceptually misled by the distractor, causing its subsequent reasoning to fail. In contrast, our Defender, which shares the same base MLLM architecture, grounds its reasoning in a robust perceptual understanding of the scene, a capability directly fostered by our adversarial training process.
  • Figure 2: An overview of our two-stage pipeline for generating the initial adversarial SFT dataset. Stage 1: Scene Extension to Increase Visual Complexity. We begin with a source image and apply outpainting to expand the scene, thereby increasing its complexity. The resulting image is then subjected to a rigorous filtering process, including Composition, Duplication, and Realism Checks, to ensure its quality and logical consistency with the original task. Stage 2: Adversarial Implantation of Semantic Distractors. For an extended image that a target MLLM can correctly interpret, we generate proposals for semantic distractors—new objects and their locations—to be inpainted into the scene. These proposals are validated to prevent spatial overlap or semantic duplication of existing target instances. The final image is added to the SFT dataset only if the implanted distractor is effective, causing the MLLM to fail.
  • Figure 3: Demonstration of the necessity for initial SFT. The base Qwen-Image-Edit model (without SFT), when prompted to generate a semantic distractor, fundamentally misunderstands the instruction. Instead of adding a confusing element, it directly inserts the object mentioned in the question (e.g., adding a bicycle when the question is about a bicycle's location relative to a boat). This behavior highlights its inability to comprehend complex, adversarial instructions, motivating our creation of an initial SFT dataset to explicitly teach this capability.
  • Figure 4: An overview of our iterative attacker-defender co-evolution framework. The process consists of two interconnected training loops. Attacker Evolution: The active attacker ($M_{atk}^{(N)}$) is refined using Flow-GRPO. It learns to generate adversarial edits designed to deceive the frozen, previous-generation defender ($M_{def}^{(N-1)}$). The success of this deception provides the 'Effectiveness Reward' that drives the attacker's policy update. Defender Enhancement: Subsequently, the newly updated attacker generates challenging examples to train the active defender ($M_{def}^{(N)}$). The defender is updated via DAPO based on an 'Accuracy Reward' derived from its performance on these adversarial inputs. This cycle repeats, progressively enhancing the capabilities of both models.
  • Figure 5: Qualitative examples of the diverse attack strategies autonomously discovered by our attacker model. Crucially, these strategies generalized far beyond our initial SFT dataset, which only contained examples of object addition. The figure displays the original image (far left) followed by the results of five distinct attack types. These include imperceptible pixel-level perturbations and four types of perceptible semantic manipulations: object replacement (e.g., changing a blue suitcase to red), object removal (e.g., a green bag), object addition (e.g., a green tag), and a hybrid attack that combines multiple manipulations. For each attack, we show the attacked image and a difference map highlighting the manipulated regions.
  • ...and 8 more figures