Memory-Driven Text-to-Image Generation

Bowen Li; Philip H. S. Torr; Thomas Lukasiewicz

Memory-Driven Text-to-Image Generation

Bowen Li, Philip H. S. Torr, Thomas Lukasiewicz

TL;DR

The paper tackles the challenge of producing highly realistic and semantically aligned images from natural language descriptions in complex scenes. It proposes a memory-driven semi-parametric framework that combines a non-parametric memory bank $M$ of image features with a parametric GAN, and augments the discriminator with content information to preserve geometric structure. Retrieval strategies including Sentence-Sentence, Sentence-Image, Words-Words, and Words-Image Matching select semantically compatible cues from $M$ to guide generation, while the generator integrates these cues via a text-image affine fusion mechanism and disentangled features. Experiments on CUB and COCO show improved FID and R-precision, supported by human evaluations, demonstrating that leveraging retrieved image cues and content-aware feedback yields more realistic and geometrically coherent images. The approach highlights the practical potential of semi-parametric, memory-guided generation for complex text-to-image tasks, enabling better use of large image datasets at inference time.

Abstract

We introduce a memory-driven semi-parametric approach to text-to-image generation, which is based on both parametric and non-parametric techniques. The non-parametric component is a memory bank of image features constructed from a training set of images. The parametric component is a generative adversarial network. Given a new text description at inference time, the memory bank is used to selectively retrieve image features that are provided as basic information of target images, which enables the generator to produce realistic synthetic results. We also incorporate the content information into the discriminator, together with semantic features, allowing the discriminator to make a more reliable prediction. Experimental results demonstrate that the proposed memory-driven semi-parametric approach produces more realistic images than purely parametric approaches, in terms of both visual fidelity and text-image semantic consistency.

Memory-Driven Text-to-Image Generation

TL;DR

of image features with a parametric GAN, and augments the discriminator with content information to preserve geometric structure. Retrieval strategies including Sentence-Sentence, Sentence-Image, Words-Words, and Words-Image Matching select semantically compatible cues from

to guide generation, while the generator integrates these cues via a text-image affine fusion mechanism and disentangled features. Experiments on CUB and COCO show improved FID and R-precision, supported by human evaluations, demonstrating that leveraging retrieved image cues and content-aware feedback yields more realistic and geometrically coherent images. The approach highlights the practical potential of semi-parametric, memory-guided generation for complex text-to-image tasks, enabling better use of large image datasets at inference time.

Abstract

Paper Structure (18 sections, 6 equations, 6 figures, 3 tables)

This paper contains 18 sections, 6 equations, 6 figures, 3 tables.

Introduction
Related Work
Overview
Memory Bank
Representation
Retrieval
Sentence-Sentence Matching
Sentence-Image Matching
Words-Words Matching
Words-Image Matching
Generative Adversarial Networks
Generator with Image Features
Discriminator with Content Information
Training
Experiments
...and 3 more sections

Figures (6)

Figure 1: Examples of text-to-image generation on COCO. Current approaches only generate low-quality images with unrealistic objects. In contrast, our method can produce realistic images, in terms of both visual appearances and geometric structure.
Figure 2: The design of the memory bank to provide image features at inference time. Note that we use the corresponding real training image to represent image features for a better visualization.
Figure 3: The architecture of our proposed method. The red box indicates the inference pipeline that retrieves image features from a memory bank according to the a given description $S$, during in training, we directly feed image features from the text-paired training image. $z$ is a random vector drawn from the Gaussian distribution, ACM denotes the text-image affine combination module.
Figure 4: The architecture of the proposed discriminator with the incorporation of content information.
Figure 5: Qualitative results on CUB and COCO: top row is the given unseen sentences; middle row: the image features extracted from the memory bank $M$ (we use corresponding images to represent the image features for a better visualization); bottom row: the synthetic results.
...and 1 more figures

Memory-Driven Text-to-Image Generation

TL;DR

Abstract

Memory-Driven Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)