VIXEN: Visual Text Comparison Network for Image Difference Captioning

Alexander Black; Jing Shi; Yifei Fan; Tu Bui; John Collomosse

VIXEN: Visual Text Comparison Network for Image Difference Captioning

Alexander Black, Jing Shi, Yifei Fan, Tu Bui, John Collomosse

TL;DR

It is shown that VIXEN produces state-of-the-art, comprehensible difference captions for diverse image contents and edit types, offering a potential mitigation against misinformation disseminated via manipulated image content.

Abstract

We present VIXEN - a technique that succinctly summarizes in text the visual differences between a pair of images in order to highlight any content manipulation present. Our proposed network linearly maps image features in a pairwise manner, constructing a soft prompt for a pretrained large language model. We address the challenge of low volume of training data and lack of manipulation variety in existing image difference captioning (IDC) datasets by training on synthetically manipulated images from the recent InstructPix2Pix dataset generated via prompt-to-prompt editing framework. We augment this dataset with change summaries produced via GPT-3. We show that VIXEN produces state-of-the-art, comprehensible difference captions for diverse image contents and edit types, offering a potential mitigation against misinformation disseminated via manipulated image content. Code and data are available at http://github.com/alexblck/vixen

VIXEN: Visual Text Comparison Network for Image Difference Captioning

TL;DR

Abstract

Paper Structure (12 sections, 5 equations, 6 figures, 3 tables)

This paper contains 12 sections, 5 equations, 6 figures, 3 tables.

Introduction
Related Work
Methodology
Data Generation
Architecture
Training
Experiments
Data
Metrics
Results
Limitations
Conclusion

Figures (6)

Figure 1: Visual change summarization produced by VIXEN for original-manipulated image pairs. VIXEN is able to observe both background (left) and main subject (mid) changes as well as generalize to other datasets (right).
Figure 2: Model architecture and data captioning augmentation pipeline diagram. We use a pre-trained image encoder network $\mathcal{E}$ to produce a representation of two images. Both of these are projected into the input space of a large language model (LM) by a trained linear projection layer $\mathcal{P}$. Frozen layers are marked in blue, trainable in red.
Figure 3: Image-caption pairs with an average correspondence score of 3 (left): may contain global changes when only local ones are expected (top) or fail to produce desired edits due to vague captioning (bot); 4 (mid): partially satisfy the caption, occasionally only some properties are realized correctly (top) or an existing object is replaced rather than added to the background (bot); 5 (right): mostly faithful to the depicted edits.
Figure 4: Examples of edit summarizations for global changes, object replacement and material changes produced by VIXEN and CLIP4IDC on InstructPix2Pix (a) and PSBattles (b) datasets. Failure case marked with a dashed red box.
Figure 5: Limitations of the proposed method. Left: image captioning instead of difference captioning in case of unidentified edit. Middle: mismatch between target text-image pair and LM runoff. Right: edit described in reverse order.
...and 1 more figures

VIXEN: Visual Text Comparison Network for Image Difference Captioning

TL;DR

Abstract

VIXEN: Visual Text Comparison Network for Image Difference Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)