Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation

Jeff Da; Maxwell Forbes; Rowan Zellers; Anthony Zheng; Jena D. Hwang; Antoine Bosselut; Yejin Choi

Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation

Jeff Da, Maxwell Forbes, Rowan Zellers, Anthony Zheng, Jena D. Hwang, Antoine Bosselut, Yejin Choi

TL;DR

Edited Media Understanding Frames (EMU) address the need to reason about the intent and implications of image edits, not just detect edits. The authors define a six-dimension frame scheme and introduce the EMU dataset with 56k QA pairs over 8k image pairs, collected from Photoshop battles, grounding explanations in image regions. They propose PELICAN, a multimodal Transformer-based model with topologically sorted region prioritization to handle edited-image reasoning, achieving gains over baselines though substantial headroom remains compared to humans. The work demonstrates the practicality of generating grounded explanations for disinformation-related edits and highlights directions for future research in commonsense, grounding, and real-world deployment.

Abstract

Multimodal disinformation, from 'deepfakes' to simple edits that deceive, is an important societal problem. Yet at the same time, the vast majority of media edits are harmless -- such as a filtered vacation photo. The difference between this example, and harmful edits that spread disinformation, is one of intent. Recognizing and describing this intent is a major challenge for today's AI systems. We present the task of Edited Media Understanding, requiring models to answer open-ended questions that capture the intent and implications of an image edit. We introduce a dataset for our task, EMU, with 48k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 40.35% of the time. At the same time, there is still much work to be done -- humans prefer human-annotated captions 93.56% of the time -- and we provide analysis that highlights areas for further progress.

Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation

TL;DR

Abstract

Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)