OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

Jinjie Shen; Jing Wu; Yaxiong Wang; Lechao Cheng; Shengeng Tang; Tianrui Hui; Nan Pu; Zhun Zhong

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

Jinjie Shen, Jing Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

TL;DR

OmniVL-Guard tackles the challenge of unified vision-language forgery detection and grounding across text, images, and videos by addressing a difficulty bias that skews learning toward coarse veracity tasks. It introduces Self-Evolving CoT Generation to produce high-quality reasoning data and the FSFR corpus, paired with Adaptive Reward Scaling Policy Optimization to balance multi-task RL. The framework demonstrates strong in-domain performance and zero-shot robustness on out-of-domain benchmarks, validated through extensive ablations and backbone-transfer experiments. This work provides a scalable, reasoning-enabled approach for detecting and localizing multi-modal forgeries with broad implications for content integrity in social media environments.

Abstract

Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

TL;DR

Abstract

Paper Structure (54 sections, 29 equations, 18 figures, 10 tables, 1 algorithm)

This paper contains 54 sections, 29 equations, 18 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Self-Evolving Forensic CoT Generation
Source Data Collection
Forensic Reasoning Seed Priming
Seed Bootstrapping through Self-Evolution
Collaborative Hard-CoT Synthesis
Dataset Statistics
ARSPO: Balanced Multi-Task RL
Theoretical Analysis and Motivation
ARSPO: Adaptive Reward Shaping Policy Optimization
Experiments
Performance Comparison
Ablation Study
Study of ARSPO in Single-Task Scenarios
...and 39 more sections

Figures (18)

Figure 1: This work explores the task of omnibus vision-language forgery detection and grounding (left). In this unified setting, simple Supervised Fine-Tuning (SFT) can not achieve coordinated performance improvements. In response, we propose Adaptive Reward Scaling Policy Optimization (ARSPO), achieving balanced optimization in detecting and grounding tasks (right).
Figure 2: (a) The construction process of $\text{FSFR}_{\text{sft}}$. (b) An example from $\text{FSFR}_{\text{sft}}$. (c) Statistics of the union of $\text{FSFR}_{\text{sft}}$ and $\text{FSFR}_{\text{rl}}$, including the distribution and word clouds.
Figure 3: Impact of different reward mapping functions $g_k(\cdot)$ on the gradient sensitivity term $\frac{g'_k(x_{i,k})}{\sigma}$ in Eq. \ref{['eq:4']}. (A) Linear mapping with $g_k(x)=ax$; (B) Convex mapping with $g_k(x)=e^{ax}$ (where $a=3$). Red bars indicate superior responses with higher rewards $A_{i,k}$ within each query group. The comparison demonstrates that convex mapping significantly amplifies the gradient contribution of superior responses, whereas linear mapping results in nearly uniform sensitivity across responses of varying quality.
Figure 4: Ablation Study on Reward Mapping Steepness and Single-task Scenarios. (a) Impact of parameter $a$ on text localization; (b) Impact of parameter $a$ on image localization; (c) Performance comparison of reward mapping functions in single-task settings.
Figure :
...and 13 more figures

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

TL;DR

Abstract

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

Authors

TL;DR

Abstract

Table of Contents

Figures (18)