Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability

Yejun Yoon; Seunghyun Yoon; Kunwoo Park

Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability

Yejun Yoon, Seunghyun Yoon, Kunwoo Park

TL;DR

This work addresses the problem of determining whether a news thumbnail accurately represents the actors in the article text, a task with implications for journalistic integrity and online credibility. It introduces NewsTT, a manually labeled dataset of 1,000 news thumbnail–text pairs with Who-based annotations, and proposes CFT-CLIP, a counterfactual-text-guided contrastive learning framework built on CLIP to improve cross-modal matching for this task. CFT-CLIP generates counterfactual texts by masking named entities and replacing them with plausible alternatives via a masked language model, and optimizes a contrastive objective that pushes the image and real text together while pushing apart the image and both real and counterfactual texts. Empirical results show CFT-CLIP outperforms standard vision-language models and domain-adapted baselines, with the strongest gains when targeting person-type entities; ablations reveal the importance of counterfactual generation quality and data sources. This approach advances automatic evaluation of news thumbnail representativeness and has potential applications in thumbnail recommendation, misinformation detection, and automated fact verification, while acknowledging limitations such as dataset size, potential biases, and handling long text.

Abstract

This paper addresses the critical challenge of assessing the representativeness of news thumbnail images, which often serve as the first visual engagement for readers when an article is disseminated on social media. We focus on whether a news image represents the actors discussed in the news text. To serve the challenge, we introduce NewsTT, a manually annotated dataset of 1000 news thumbnail images and text pairs. We found that the pretrained vision and language models, such as BLIP-2, struggle with this task. Since news subjects frequently involve named entities or proper nouns, the pretrained models could have a limited capability to match news actors' visual and textual appearances. We hypothesize that learning to contrast news text with its counterfactual, of which named entities are replaced, can enhance the cross-modal matching ability of vision and language models. We propose CFT-CLIP, a contrastive learning framework that updates vision and language bi-encoders according to the hypothesis. We found that our simple method can boost the performance for assessing news thumbnail representativeness, supporting our assumption. Code and data can be accessed at https://github.com/ssu-humane/news-images-acl24.

Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability

TL;DR

Abstract

Paper Structure (37 sections, 2 equations, 6 figures, 10 tables)

This paper contains 37 sections, 2 equations, 6 figures, 10 tables.

Introduction
Related Works
Vision language contrastive pretraining
Multimodal misinformation
Target Problem: Assessing News Thumbnail Representativeness
NewsTT: A Dataset of Thumbnail Representativeness for News Text
Raw data collection
Data annotation
Data analysis
Methods
Background: CLIP
Proposed method: CFT-CLIP
Counterfactual text generation
Training objective
Model architecture
...and 22 more sections

Figures (6)

Figure 1: An illustration of the key idea of the proposed method. To assess whether a news thumbnail image represents the body text, the method generates counterfactual text to be used as negative samples for contrastive updates.
Figure 2: Labeled data examples.
Figure 3: An illustration of the counterfactual text generation process by CFT-CLIP. We select named entity tokens in an original text that could indicate news subjects. A masked language model generates a counterfactual text by predicting new tokens for the selected entity tokens.
Figure 4: Neural architecture of CFT-CLIP
Figure 5: Error examples
...and 1 more figures

Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability

TL;DR

Abstract

Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability

Authors

TL;DR

Abstract

Table of Contents

Figures (6)