Table of Contents
Fetching ...

Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective

Xiangru Zhu, Penglei Sun, Yaoxian Song, Yanghua Xiao, Zhixu Li, Chengyu Wang, Jun Huang, Bei Yang, Xiaoxiao Xu

TL;DR

Semantic variation caused by word order poses a major challenge for text-to-image synthesis. The authors introduce SemVarEffect, a causal metric defined as $\kappa = \mathbb{E}[\gamma^{I}|do(T\neq T_a)] - \mathbb{E}[\gamma^{I}|do(T=T_a)]$, and SemVarBench, a dataset with permutation-variance and permutation-invariance tests. Evaluations over 13 T2I models reveal high text-image alignment but weak semantic sensitivity (typically $\kappa<0.2$), with cross-modal alignment (UNet/Transformers) playing a crucial role. Fine-tuning yields limited gains and can reduce robustness, underscoring the need to model inter-token relations and cross-modal interaction; code and benchmark are released on GitHub.

Abstract

Accurate interpretation and visualization of human instructions are crucial for text-to-image (T2I) synthesis. However, current models struggle to capture semantic variations from word order changes, and existing evaluations, relying on indirect metrics like text-image similarity, fail to reliably assess these challenges. This often obscures poor performance on complex or uncommon linguistic patterns by the focus on frequent word combinations. To address these deficiencies, we propose a novel metric called SemVarEffect and a benchmark named SemVarBench, designed to evaluate the causality between semantic variations in inputs and outputs in T2I synthesis. Semantic variations are achieved through two types of linguistic permutations, while avoiding easily predictable literal variations. Experiments reveal that the CogView-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1. Semantic variations in object relations are less understood than attributes, scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in UNet or Transformers plays a crucial role in handling semantic variations, a factor previously overlooked by a focus on textual encoders. Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding. Our benchmark and code are available at https://github.com/zhuxiangru/SemVarBench .

Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective

TL;DR

Semantic variation caused by word order poses a major challenge for text-to-image synthesis. The authors introduce SemVarEffect, a causal metric defined as , and SemVarBench, a dataset with permutation-variance and permutation-invariance tests. Evaluations over 13 T2I models reveal high text-image alignment but weak semantic sensitivity (typically ), with cross-modal alignment (UNet/Transformers) playing a crucial role. Fine-tuning yields limited gains and can reduce robustness, underscoring the need to model inter-token relations and cross-modal interaction; code and benchmark are released on GitHub.

Abstract

Accurate interpretation and visualization of human instructions are crucial for text-to-image (T2I) synthesis. However, current models struggle to capture semantic variations from word order changes, and existing evaluations, relying on indirect metrics like text-image similarity, fail to reliably assess these challenges. This often obscures poor performance on complex or uncommon linguistic patterns by the focus on frequent word combinations. To address these deficiencies, we propose a novel metric called SemVarEffect and a benchmark named SemVarBench, designed to evaluate the causality between semantic variations in inputs and outputs in T2I synthesis. Semantic variations are achieved through two types of linguistic permutations, while avoiding easily predictable literal variations. Experiments reveal that the CogView-3-Plus and Ideogram 2 performed the best, achieving a score of 0.2/1. Semantic variations in object relations are less understood than attributes, scoring 0.07/1 compared to 0.17-0.19/1. We found that cross-modal alignment in UNet or Transformers plays a crucial role in handling semantic variations, a factor previously overlooked by a focus on textual encoders. Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding. Our benchmark and code are available at https://github.com/zhuxiangru/SemVarBench .

Paper Structure

This paper contains 47 sections, 18 equations, 27 figures, 24 tables.

Figures (27)

  • Figure 1: Failed state-of-the-art (SOTA) T2I model examples: different permutations of the same words, different textual semantics, yet similar visual semantics.
  • Figure 2: Framework for measuring semantic variation causality in T2I models. Our evaluation consists of three components: (I) Input Variations with semantic change/maintenance interventions, (II) Visual Semantic Evaluation under both interventions (blue for semantic change, pink for semantic maintenance), and (III) Causal Effect Calculation where SemVarEffect (purple) quantifies the difference between intervention outcomes. For Comparison, traditional alignment scores (gray) only measure surface similarity, as shown in the cat-mouse example where high alignment coexists with poor semantic consistency. See Section \ref{['sec:Problem Formulation']} for mathematical details.
  • Figure 3: Causal relationship between the input and the output semantic variations.
  • Figure 4: The data collection process of SemVarBench. Top: Templates. Bottom: Generated Sentences. The templates are extracted from the seed pair "a dog is using a wheelchair and the dog is next to a person"/"a person is using a wheelchair and the person is next to a dog".
  • Figure 5: Distribution of semantic variations by category in the semVarBench test set.
  • ...and 22 more figures