Exposing Text-Image Inconsistency Using Diffusion Models

Mingzhen Huang; Shan Jia; Zhou Zhou; Yan Ju; Jialing Cai; Siwei Lyu

Exposing Text-Image Inconsistency Using Diffusion Models

Mingzhen Huang, Shan Jia, Zhou Zhou, Yan Ju, Jialing Cai, Siwei Lyu

TL;DR

This work tackles text-image inconsistency by moving beyond binary classifiers to a diffusion-based localization framework, D-TIIL, which uses text-to-image diffusion models as an omniscient knowledge source to align and edit modalities. The method develops a four-step pipeline that yields pixel-level masks and word-level inconsistent terms, accompanied by a consistency score $r \,\in\, [0,100]$. A new TIIL dataset with 14K curated image-text pairs enables fine-grained evaluation, including manual ground-truth masks and region-level edits guided by diffusion generation. Experiments show that D-TIIL outperforms baselines in both localization and detection tasks and that denoising-based alignment and careful text embedding updates are crucial for maximizing explainability and accuracy. The approach offers a scalable, evidence-based tool for misinformation research and opens avenues for domain-specific diffusion models to capture nuanced background knowledge.

Abstract

In the battle against widespread online misinformation, a growing problem is text-image inconsistency, where images are misleadingly paired with texts with different intent or meaning. Existing classification-based methods for text-image inconsistency can identify contextual inconsistencies but fail to provide explainable justifications for their decisions that humans can understand. Although more nuanced, human evaluation is impractical at scale and susceptible to errors. To address these limitations, this study introduces D-TIIL (Diffusion-based Text-Image Inconsistency Localization), which employs text-to-image diffusion models to localize semantic inconsistencies in text and image pairs. These models, trained on large-scale datasets act as ``omniscient" agents that filter out irrelevant information and incorporate background knowledge to identify inconsistencies. In addition, D-TIIL uses text embeddings and modified image regions to visualize these inconsistencies. To evaluate D-TIIL's efficacy, we introduce a new TIIL dataset containing 14K consistent and inconsistent text-image pairs. Unlike existing datasets, TIIL enables assessment at the level of individual words and image regions and is carefully designed to represent various inconsistencies. D-TIIL offers a scalable and evidence-based approach to identifying and localizing text-image inconsistency, providing a robust framework for future research combating misinformation.

Exposing Text-Image Inconsistency Using Diffusion Models

TL;DR

. A new TIIL dataset with 14K curated image-text pairs enables fine-grained evaluation, including manual ground-truth masks and region-level edits guided by diffusion generation. Experiments show that D-TIIL outperforms baselines in both localization and detection tasks and that denoising-based alignment and careful text embedding updates are crucial for maximizing explainability and accuracy. The approach offers a scalable, evidence-based tool for misinformation research and opens avenues for domain-specific diffusion models to capture nuanced background knowledge.

Abstract

Paper Structure (14 sections, 2 equations, 108 figures, 10 tables)

This paper contains 14 sections, 2 equations, 108 figures, 10 tables.

Introduction
Backgrounds
Related Works
Text-to-image Diffusion Models
Method
TIIL Dataset
Experiments
Settings
Comparison with Existing Methods
Ablation Studies
Failure Cases
Conclusion
TIIL Dataset
Additional Ablation Studies

Figures (108)

Figure 1: Exposing text-image inconsistency based on previous methods and our method. Instead of employing a binary classification model, D-TIIL offers interpretable evidence by localizing word- and pixel-level inconsistencies and quantifying them through a consistency score.
Figure 2: The overall pipeline of D-TIIL. See texts for details.
Figure 3: The main process of D-TIIL is illustrated conceptually with Venn diagrams, where the semantic contents of text and image are represented as two circles. The four steps gradually align the semantic contents to facilitate exposure of inconsistency: given an initial image-text pair $(I, E_0)$, the proposed method first produces a text embedding ${\bf E}_{aln}$ that is aligned with $I$, and then an edited image $I_{edt}$ to filter the inconsistency. In Step 3, the model optimizes $E_0$ from the $I_{edt}$ to obtain a $E_{dnt}$ which is aligned with $I_{edt}$. Finally, in Step 4, the model produces the inconsistency mask from well-aligned pair $(I,E_{aln}, E_{dnt})$.
Figure 4: Pipeline depicting the generation and annotation process of the proposed TIIL dataset.
Figure : COSMOS
...and 103 more figures

Exposing Text-Image Inconsistency Using Diffusion Models

TL;DR

Abstract

Exposing Text-Image Inconsistency Using Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (108)