Table of Contents
Fetching ...

TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models

Georgia Gabriela Sampaio, Ruixiang Zhang, Shuangfei Zhai, Jiatao Gu, Josh Susskind, Navdeep Jaitly, Yizhe Zhang

TL;DR

This work introduces a new evaluation framework called TypeScore to sensitively assess a model's ability to generate images with high-fidelity embedded text by following precise instructions, and argues that this text generation capability serves as a proxy for general instruction-following ability in image synthesis.

Abstract

Evaluating text-to-image generative models remains a challenge, despite the remarkable progress being made in their overall performances. While existing metrics like CLIPScore work for coarse evaluations, they lack the sensitivity to distinguish finer differences as model performance rapidly improves. In this work, we focus on the text rendering aspect of these models, which provides a lens for evaluating a generative model's fine-grained instruction-following capabilities. To this end, we introduce a new evaluation framework called TypeScore to sensitively assess a model's ability to generate images with high-fidelity embedded text by following precise instructions. We argue that this text generation capability serves as a proxy for general instruction-following ability in image synthesis. TypeScore uses an additional image description model and leverages an ensemble dissimilarity measure between the original and extracted text to evaluate the fidelity of the rendered text. Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models across a range of instructions with diverse text styles. Our study also evaluates how well these vision-language models (VLMs) adhere to stylistic instructions, disentangling style evaluation from embedded-text fidelity. Through human evaluation studies, we quantitatively meta-evaluate the effectiveness of the metric. Comprehensive analysis is conducted to explore factors such as text length, captioning models, and current progress towards human parity on this task. The framework provides insights into remaining gaps in instruction-following for image generation with embedded text.

TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models

TL;DR

This work introduces a new evaluation framework called TypeScore to sensitively assess a model's ability to generate images with high-fidelity embedded text by following precise instructions, and argues that this text generation capability serves as a proxy for general instruction-following ability in image synthesis.

Abstract

Evaluating text-to-image generative models remains a challenge, despite the remarkable progress being made in their overall performances. While existing metrics like CLIPScore work for coarse evaluations, they lack the sensitivity to distinguish finer differences as model performance rapidly improves. In this work, we focus on the text rendering aspect of these models, which provides a lens for evaluating a generative model's fine-grained instruction-following capabilities. To this end, we introduce a new evaluation framework called TypeScore to sensitively assess a model's ability to generate images with high-fidelity embedded text by following precise instructions. We argue that this text generation capability serves as a proxy for general instruction-following ability in image synthesis. TypeScore uses an additional image description model and leverages an ensemble dissimilarity measure between the original and extracted text to evaluate the fidelity of the rendered text. Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models across a range of instructions with diverse text styles. Our study also evaluates how well these vision-language models (VLMs) adhere to stylistic instructions, disentangling style evaluation from embedded-text fidelity. Through human evaluation studies, we quantitatively meta-evaluate the effectiveness of the metric. Comprehensive analysis is conducted to explore factors such as text length, captioning models, and current progress towards human parity on this task. The framework provides insights into remaining gaps in instruction-following for image generation with embedded text.

Paper Structure

This paper contains 37 sections, 4 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: When assessing target image generation models $p_\theta$, we provide the model with a set of instructions. These instructions prompt the model to create a set of images$i$ based on specified quoted text within a particular style, alongside some contextual information. We then use a vision-language model $q_\phi$ (e.g. GPT-4o) to extract the text from the generated images, and compute the similarity score between the generated text$\hat{t}$ and the original quote$t$. TypeScore is calculated by averaging the scores obtained from multiple image generations. Common text-image alignment metrics such as CLIPScore produce indistinguishable results for both image generation models under this prompt.
  • Figure 2: Sampled generations of ideogram using the instructions from TypeInst.
  • Figure 3: When extracting text, OCR tends to introduce errors, while VLMs tend to autocorrect existing errors in the rendered text.
  • Figure 4: Composition of TypeInst dataset.
  • Figure 5: The tutorial UI is split into 3 sections, corresponding to the sections of the annotation task: text fidelity, style fidelity, and overall preference. Each section contains example image pairs that demonstrate potential issues annotators might encounter, along with the correct answers for each scenario.
  • ...and 3 more figures