ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis
Aashish Anantha Ramakrishnan, Sharon X. Huang, Dongwon Lee
TL;DR
The paper tackles the challenge of generating accurate images from abstractive news captions, where high-level context and Named Entities complicate visual grounding. It introduces ANCHOR, a large-scale abstractive-caption dataset with Non-Entity and Entity subsets to stress-test caption understanding and NE generation. The core contribution is Subject-Aware Fine-Tuning (SAFE), which uses LLM-derived subject weights to reweight caption tokens and adapt diffusion-based T2I generation to the intended subjects; it is complemented by Domain Fine-Tuning (DFE) that aligns the model to news-domain distributions using a reward-driven objective and tuned optimization. Across extensive experiments, SAFE improves image-caption alignment and realism (as measured by $FID_{CLIP}$, ImageReward, and HPS V2) and receives human preference over baselines, demonstrating improved NLU in T2I and practical potential for journalism-support tools. The work highlights ongoing challenges in NE generation and points toward richer evaluation metrics and entity-focused finetuning as future directions.
Abstract
Text-to-Image (T2I) Synthesis has made tremendous strides in enhancing synthesized image quality, but current datasets evaluate model performance only on descriptive, instruction-based prompts. Real-world news image captions take a more pragmatic approach, providing high-level situational and Named-Entity (NE) information and limited physical object descriptions, making them abstractive. To evaluate the ability of T2I models to capture intended subjects from news captions, we introduce the Abstractive News Captions with High-level cOntext Representation (ANCHOR) dataset, containing 70K+ samples sourced from 5 different news media organizations. With Large Language Models (LLM) achieving success in language and commonsense reasoning tasks, we explore the ability of different LLMs to identify and understand key subjects from abstractive captions. Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights. It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR. By launching the ANCHOR dataset, we hope to motivate research in furthering the Natural Language Understanding (NLU) capabilities of T2I models.
