Table of Contents
Fetching ...

Omni-IML: Towards Unified Image Manipulation Localization

Chenfan Qu, Yiwu Zhong, Fengjun Guo, Lianwen Jin

TL;DR

Omni-IML tackles the lack of cross-task generalization in image manipulation localization by introducing a generalist model that adapts encoding and decoding per sample while enhancing tampered-region features through anomaly supervision. It couples a Modal Gate Encoder, Anomaly Enhancement, and Dynamic Weight Decoder with an interpretation module that leverages a reference visual prompt and a chain-of-thought annotated Omni-273k dataset for natural-language artifact descriptions. Across natural, document, face, and scene text IML, a single model achieves state-of-the-art results without task-specific fine-tuning, and ablations validate the contribution of each component. The work advances practical, scalable image forensics and paves the way for future generalist approaches, with code and Omni-273k to be released publicly.

Abstract

Existing Image Manipulation Localization (IML) methods mostly rely heavily on task-specific designs, making them perform well only on the target IML task, while joint training on multiple IML tasks causes significant performance degradation, hindering real applications. To this end, we propose Omni-IML, the first generalist model designed to unify IML across diverse tasks. Specifically, Omni-IML achieves generalization through three key components: (1) a Modal Gate Encoder, which adaptively selects the optimal encoding modality per sample, (2) a Dynamic Weight Decoder, which dynamically adjusts decoder filters to the task at hand, and (3) an Anomaly Enhancement module that leverages box supervision to highlight the tampered regions and facilitate the learning of task-agnostic features. Beyond localization, to support interpretation of the tampered images, we construct Omni-273k, a large high-quality dataset that includes natural language descriptions of tampered artifact. It is annotated through our automatic, chain-of-thoughts annotation technique. We also design a simple-yet-effective interpretation module to better utilize these descriptive annotations. Our extensive experiments show that our single Omni-IML model achieves state-of-the-art performance across all four major IML tasks, providing a valuable solution for practical deployment and a promising direction of generalist models in image forensics. Our code and dataset will be publicly available.

Omni-IML: Towards Unified Image Manipulation Localization

TL;DR

Omni-IML tackles the lack of cross-task generalization in image manipulation localization by introducing a generalist model that adapts encoding and decoding per sample while enhancing tampered-region features through anomaly supervision. It couples a Modal Gate Encoder, Anomaly Enhancement, and Dynamic Weight Decoder with an interpretation module that leverages a reference visual prompt and a chain-of-thought annotated Omni-273k dataset for natural-language artifact descriptions. Across natural, document, face, and scene text IML, a single model achieves state-of-the-art results without task-specific fine-tuning, and ablations validate the contribution of each component. The work advances practical, scalable image forensics and paves the way for future generalist approaches, with code and Omni-273k to be released publicly.

Abstract

Existing Image Manipulation Localization (IML) methods mostly rely heavily on task-specific designs, making them perform well only on the target IML task, while joint training on multiple IML tasks causes significant performance degradation, hindering real applications. To this end, we propose Omni-IML, the first generalist model designed to unify IML across diverse tasks. Specifically, Omni-IML achieves generalization through three key components: (1) a Modal Gate Encoder, which adaptively selects the optimal encoding modality per sample, (2) a Dynamic Weight Decoder, which dynamically adjusts decoder filters to the task at hand, and (3) an Anomaly Enhancement module that leverages box supervision to highlight the tampered regions and facilitate the learning of task-agnostic features. Beyond localization, to support interpretation of the tampered images, we construct Omni-273k, a large high-quality dataset that includes natural language descriptions of tampered artifact. It is annotated through our automatic, chain-of-thoughts annotation technique. We also design a simple-yet-effective interpretation module to better utilize these descriptive annotations. Our extensive experiments show that our single Omni-IML model achieves state-of-the-art performance across all four major IML tasks, providing a valuable solution for practical deployment and a promising direction of generalist models in image forensics. Our code and dataset will be publicly available.

Paper Structure

This paper contains 18 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The proposed Omni-IML is the first generalist model for IML. A single model can simultaneously achieve state-of-the-art performance on multiple major IML tasks, without task-specific and benchmark-specific fine-tuning.
  • Figure 2: The overall framework of the proposed Omni-IML.
  • Figure 3: The proposed Modal Gate (left), Anomaly Enhancement Module (middle) and Dynamic Weight Decoder (right).
  • Figure 4: The proposed Chain-of-Thoughts Pipeline.
  • Figure 5: The proposed Chain-of-Thoughts Pipeline.