Table of Contents
Fetching ...

NewsCaption: Named-Entity aware Captioning for Out-of-Context Media

Anurag Singh, Shivangi Aneja

TL;DR

This work tackles the problem of misinformation by enabling targeted out-of-context captions for images conditioned on textual context tokens. It introduces an end-to-end architecture that fuses named-entity recognition, a relational graph, and a Transformer-based captioning module, leveraging BPE to handle out-of-vocabulary tokens and multimodal features from CLIP and DETR. The approach shows that conditioning on textual input improves caption quality and controllability, achieving improvements over baselines on the COSMOS dataset and supported by qualitative analyses and human evaluation. The findings offer a practical benchmark and a plug-in component for strengthening out-of-context detection and misinformation defenses, while acknowledging ethical considerations and limitations in generalizability.

Abstract

With the increasing influence of social media, online misinformation has grown to become a societal issue. The motivation for our work comes from the threat caused by cheapfakes, where an unaltered image is described using a news caption in a new but false-context. The main challenge in detecting such out-of-context multimedia is the unavailability of large-scale datasets. Several detection methods employ randomly selected captions to generate out-of-context training inputs. However, these randomly matched captions are not truly representative of out-of-context scenarios due to inconsistencies between the image description and the matched caption. We aim to address these limitations by introducing a novel task of out-of-context caption generation. In this work, we propose a new method that generates a realistic out-of-context caption given visual and textual context. We also demonstrate that the semantics of the generated captions can be controlled using the textual context. We also evaluate our method against several baselines and our method improves over the image captioning baseline by 6.2% BLUE-4, 2.96% CiDEr, 11.5% ROUGE, and 7.3% METEOR

NewsCaption: Named-Entity aware Captioning for Out-of-Context Media

TL;DR

This work tackles the problem of misinformation by enabling targeted out-of-context captions for images conditioned on textual context tokens. It introduces an end-to-end architecture that fuses named-entity recognition, a relational graph, and a Transformer-based captioning module, leveraging BPE to handle out-of-vocabulary tokens and multimodal features from CLIP and DETR. The approach shows that conditioning on textual input improves caption quality and controllability, achieving improvements over baselines on the COSMOS dataset and supported by qualitative analyses and human evaluation. The findings offer a practical benchmark and a plug-in component for strengthening out-of-context detection and misinformation defenses, while acknowledging ethical considerations and limitations in generalizability.

Abstract

With the increasing influence of social media, online misinformation has grown to become a societal issue. The motivation for our work comes from the threat caused by cheapfakes, where an unaltered image is described using a news caption in a new but false-context. The main challenge in detecting such out-of-context multimedia is the unavailability of large-scale datasets. Several detection methods employ randomly selected captions to generate out-of-context training inputs. However, these randomly matched captions are not truly representative of out-of-context scenarios due to inconsistencies between the image description and the matched caption. We aim to address these limitations by introducing a novel task of out-of-context caption generation. In this work, we propose a new method that generates a realistic out-of-context caption given visual and textual context. We also demonstrate that the semantics of the generated captions can be controlled using the textual context. We also evaluate our method against several baselines and our method improves over the image captioning baseline by 6.2% BLUE-4, 2.96% CiDEr, 11.5% ROUGE, and 7.3% METEOR
Paper Structure (20 sections, 2 equations, 9 figures, 7 tables)

This paper contains 20 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The semantic figure describes image and news caption input that are pre-processed using Object-detection, CLIP, and Named Entity Recognition respectively to obtain image and object level features along with a named entity dictionary. The object features are enhanced with a relational graph. The named entities are encoded into text embedding using byte-pair encoding(BPE). The embeddings act as the input to the encoder of the captioning module. Similarly, the original news caption is tokenized using BPE to form input to the decoder during training, and CE loss is used to optimize the model.
  • Figure 2: The relational graph module takes CLIP image encoding from object proposals as input that represent nodes in the graph.
  • Figure 3: The semantic figure describes the test time image and conditional word token as input to our model. Image is processed using Object-detection and CLIP. Byte pair encoding converts word tokens into text embeddings. These representations form an input to the encoder of the captioning module. We condition the decoder using a start token that denotes the start of a sentence. It then generates a caption in an auto-regressive fashion.
  • Figure 4: Qualitative comparison of caption generated by different model baselines. The incorrect attributes being included in the caption are highlighted by underlining in the captions. The green highlighting of the text in the caption denotes the semantics which the model understands from the image input.
  • Figure 5: Qualitative Comparison of the effect of the conditional word tokens on the semantics of caption generated. The green highlighted words in the generated caption denote the semantics model implicitly learns from the image input.
  • ...and 4 more figures