Table of Contents
Fetching ...

ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis

Aashish Anantha Ramakrishnan, Sharon X. Huang, Dongwon Lee

TL;DR

The paper tackles the challenge of generating accurate images from abstractive news captions, where high-level context and Named Entities complicate visual grounding. It introduces ANCHOR, a large-scale abstractive-caption dataset with Non-Entity and Entity subsets to stress-test caption understanding and NE generation. The core contribution is Subject-Aware Fine-Tuning (SAFE), which uses LLM-derived subject weights to reweight caption tokens and adapt diffusion-based T2I generation to the intended subjects; it is complemented by Domain Fine-Tuning (DFE) that aligns the model to news-domain distributions using a reward-driven objective and tuned optimization. Across extensive experiments, SAFE improves image-caption alignment and realism (as measured by $FID_{CLIP}$, ImageReward, and HPS V2) and receives human preference over baselines, demonstrating improved NLU in T2I and practical potential for journalism-support tools. The work highlights ongoing challenges in NE generation and points toward richer evaluation metrics and entity-focused finetuning as future directions.

Abstract

Text-to-Image (T2I) Synthesis has made tremendous strides in enhancing synthesized image quality, but current datasets evaluate model performance only on descriptive, instruction-based prompts. Real-world news image captions take a more pragmatic approach, providing high-level situational and Named-Entity (NE) information and limited physical object descriptions, making them abstractive. To evaluate the ability of T2I models to capture intended subjects from news captions, we introduce the Abstractive News Captions with High-level cOntext Representation (ANCHOR) dataset, containing 70K+ samples sourced from 5 different news media organizations. With Large Language Models (LLM) achieving success in language and commonsense reasoning tasks, we explore the ability of different LLMs to identify and understand key subjects from abstractive captions. Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights. It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR. By launching the ANCHOR dataset, we hope to motivate research in furthering the Natural Language Understanding (NLU) capabilities of T2I models.

ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis

TL;DR

The paper tackles the challenge of generating accurate images from abstractive news captions, where high-level context and Named Entities complicate visual grounding. It introduces ANCHOR, a large-scale abstractive-caption dataset with Non-Entity and Entity subsets to stress-test caption understanding and NE generation. The core contribution is Subject-Aware Fine-Tuning (SAFE), which uses LLM-derived subject weights to reweight caption tokens and adapt diffusion-based T2I generation to the intended subjects; it is complemented by Domain Fine-Tuning (DFE) that aligns the model to news-domain distributions using a reward-driven objective and tuned optimization. Across extensive experiments, SAFE improves image-caption alignment and realism (as measured by , ImageReward, and HPS V2) and receives human preference over baselines, demonstrating improved NLU in T2I and practical potential for journalism-support tools. The work highlights ongoing challenges in NE generation and points toward richer evaluation metrics and entity-focused finetuning as future directions.

Abstract

Text-to-Image (T2I) Synthesis has made tremendous strides in enhancing synthesized image quality, but current datasets evaluate model performance only on descriptive, instruction-based prompts. Real-world news image captions take a more pragmatic approach, providing high-level situational and Named-Entity (NE) information and limited physical object descriptions, making them abstractive. To evaluate the ability of T2I models to capture intended subjects from news captions, we introduce the Abstractive News Captions with High-level cOntext Representation (ANCHOR) dataset, containing 70K+ samples sourced from 5 different news media organizations. With Large Language Models (LLM) achieving success in language and commonsense reasoning tasks, we explore the ability of different LLMs to identify and understand key subjects from abstractive captions. Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights. It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR. By launching the ANCHOR dataset, we hope to motivate research in furthering the Natural Language Understanding (NLU) capabilities of T2I models.
Paper Structure (38 sections, 4 equations, 10 figures, 7 tables)

This paper contains 38 sections, 4 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Example of descriptive captions from the COCO Captions dataset Chen2015-qj (Left) and abstractive captions from the ANCHOR (Right). Words highlighted in Blue directly translate to visual entities while words highlighted in Red influence the image indirectly, making them abstractive.
  • Figure 2: Overview of our dataset's pre-processing and filtering steps
  • Figure 3: Overview of our Subject-Aware FinE-tuning Approach (SAFE)
  • Figure 4: Qualitative comparison of different T2I models on ANCHOR Non-Entity Subset. Words highlighted in Orange are used for subject conditioning
  • Figure 5: ANCHOR Distribution of Article Topics for samples in ANCHOR
  • ...and 5 more figures