Table of Contents
Fetching ...

Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation

Jeff Da, Maxwell Forbes, Rowan Zellers, Anthony Zheng, Jena D. Hwang, Antoine Bosselut, Yejin Choi

TL;DR

Edited Media Understanding Frames (EMU) address the need to reason about the intent and implications of image edits, not just detect edits. The authors define a six-dimension frame scheme and introduce the EMU dataset with 56k QA pairs over 8k image pairs, collected from Photoshop battles, grounding explanations in image regions. They propose PELICAN, a multimodal Transformer-based model with topologically sorted region prioritization to handle edited-image reasoning, achieving gains over baselines though substantial headroom remains compared to humans. The work demonstrates the practicality of generating grounded explanations for disinformation-related edits and highlights directions for future research in commonsense, grounding, and real-world deployment.

Abstract

Multimodal disinformation, from 'deepfakes' to simple edits that deceive, is an important societal problem. Yet at the same time, the vast majority of media edits are harmless -- such as a filtered vacation photo. The difference between this example, and harmful edits that spread disinformation, is one of intent. Recognizing and describing this intent is a major challenge for today's AI systems. We present the task of Edited Media Understanding, requiring models to answer open-ended questions that capture the intent and implications of an image edit. We introduce a dataset for our task, EMU, with 48k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 40.35% of the time. At the same time, there is still much work to be done -- humans prefer human-annotated captions 93.56% of the time -- and we provide analysis that highlights areas for further progress.

Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation

TL;DR

Edited Media Understanding Frames (EMU) address the need to reason about the intent and implications of image edits, not just detect edits. The authors define a six-dimension frame scheme and introduce the EMU dataset with 56k QA pairs over 8k image pairs, collected from Photoshop battles, grounding explanations in image regions. They propose PELICAN, a multimodal Transformer-based model with topologically sorted region prioritization to handle edited-image reasoning, achieving gains over baselines though substantial headroom remains compared to humans. The work demonstrates the practicality of generating grounded explanations for disinformation-related edits and highlights directions for future research in commonsense, grounding, and real-world deployment.

Abstract

Multimodal disinformation, from 'deepfakes' to simple edits that deceive, is an important societal problem. Yet at the same time, the vast majority of media edits are harmless -- such as a filtered vacation photo. The difference between this example, and harmful edits that spread disinformation, is one of intent. Recognizing and describing this intent is a major challenge for today's AI systems. We present the task of Edited Media Understanding, requiring models to answer open-ended questions that capture the intent and implications of an image edit. We introduce a dataset for our task, EMU, with 48k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 40.35% of the time. At the same time, there is still much work to be done -- humans prefer human-annotated captions 93.56% of the time -- and we provide analysis that highlights areas for further progress.

Paper Structure

This paper contains 28 sections, 6 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Edited Media Understanding Frames. Given a manipulated image and its source, a model must generate natural language answers to a set of open-ended questions. Our questions test the understanding of the what and why behind important changes in the image -- like that subject1 appears to be on good terms with subject2.
  • Figure 2: An example from EMU. Given a source image and its edit, and a list of main subjects in the image, we collect a label $\mathbf{l}$ and natural language responses (reponse to frame $\mathbf{y}$ and rationale $\mathbf{r}$ to applicable open-ended questions $\mathbf{q}$ covering each of five frames $f \in \mathcal{F}$. We also collect structural annotations $\mathbf{a_i}$ highlighting the edited sections of the image.
  • Figure 3: Statistics for EMU. We consider five question types, which in aggregate require a strong understanding of the image edit. The first three types are subject agnostic, though annotations refer explicitly to subjects through subject tags; two (with subjectX) are subject-specific.
  • Figure 4: Overview of PELICAN. Our model takes as input all regions $\textbf{s}$ from the source image and $\textbf{e}$ from the edited image. We order the regions in $\textbf{e}$ using a topological sort of overlapping boxes, rooted at subject1. The green regions marked with an asterisk are additional regions that were introduced, and were labeled through annotators. This ordering allows the model to selectively attend to important image regions in generating an answer to the visual question about subject1.
  • Figure 5: Generation examples from PELICAN, marked with results from human evaluation. PELICAN is able to correctly reference marked figures and is able to infer intent accordingly across each question type.
  • ...and 6 more figures