Table of Contents
Fetching ...

GazeFusion: Saliency-Guided Image Generation

Yunxiang Zhang, Nan Wu, Connor Z. Lin, Gordon Wetzstein, Qi Sun

TL;DR

GazeFusion tackles the challenge of guiding viewer attention in diffusion-based image generation by conditioning the denoising process on user-specified saliency maps. The method finetunes a ControlNet-enabled SD2.1 model on MSCOCO using saliency-image pairs and optimizes the denoiser with a loss $\mathcal{L}=\mathbb{E}_{z,t,c_t,c_s,\epsilon \sim \mathcal{N}(0,1)} \|\epsilon_{\theta}(z_t,t,c_t,c_s)-\epsilon\|_2^2$. It extends to videos by using a saliency predictor for temporal saliency ($\mathbf{V}$) and a zero-shot video pipeline, enabling temporally consistent saliency-guided generation. Empirical results from eye-tracking and model-based saliency metrics demonstrate that generated content aligns with specified attention, and the approach supports interactive design, attention suppression, and display-adaptive generation, marking a step toward perception-aware generative models.

Abstract

Diffusion models offer unprecedented image generation power given just a text prompt. While emerging approaches for controlling diffusion models have enabled users to specify the desired spatial layouts of the generated content, they cannot predict or control where viewers will pay more attention due to the complexity of human vision. Recognizing the significance of attention-controllable image generation in practical applications, we present a saliency-guided framework to incorporate the data priors of human visual attention mechanisms into the generation process. Given a user-specified viewer attention distribution, our control module conditions a diffusion model to generate images that attract viewers' attention toward the desired regions. To assess the efficacy of our approach, we performed an eye-tracked user study and a large-scale model-based saliency analysis. The results evidence that both the cross-user eye gaze distributions and the saliency models' predictions align with the desired attention distributions. Lastly, we outline several applications, including interactive design of saliency guidance, attention suppression in unwanted regions, and adaptive generation for varied display/viewing conditions.

GazeFusion: Saliency-Guided Image Generation

TL;DR

GazeFusion tackles the challenge of guiding viewer attention in diffusion-based image generation by conditioning the denoising process on user-specified saliency maps. The method finetunes a ControlNet-enabled SD2.1 model on MSCOCO using saliency-image pairs and optimizes the denoiser with a loss . It extends to videos by using a saliency predictor for temporal saliency () and a zero-shot video pipeline, enabling temporally consistent saliency-guided generation. Empirical results from eye-tracking and model-based saliency metrics demonstrate that generated content aligns with specified attention, and the approach supports interactive design, attention suppression, and display-adaptive generation, marking a step toward perception-aware generative models.

Abstract

Diffusion models offer unprecedented image generation power given just a text prompt. While emerging approaches for controlling diffusion models have enabled users to specify the desired spatial layouts of the generated content, they cannot predict or control where viewers will pay more attention due to the complexity of human vision. Recognizing the significance of attention-controllable image generation in practical applications, we present a saliency-guided framework to incorporate the data priors of human visual attention mechanisms into the generation process. Given a user-specified viewer attention distribution, our control module conditions a diffusion model to generate images that attract viewers' attention toward the desired regions. To assess the efficacy of our approach, we performed an eye-tracked user study and a large-scale model-based saliency analysis. The results evidence that both the cross-user eye gaze distributions and the saliency models' predictions align with the desired attention distributions. Lastly, we outline several applications, including interactive design of saliency guidance, attention suppression in unwanted regions, and adaptive generation for varied display/viewing conditions.
Paper Structure (28 sections, 4 equations, 13 figures, 2 tables)

This paper contains 28 sections, 4 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Attention-controllable image generation with saliency guidance. Given a text prompt and saliency map pair, GazeFusion generates images that not only contain the content as described by the text prompt but also attract viewers’ attention toward the desired image regions as emphasized by the saliency map. As illustrated in \ref{['fig:teaser-all']}, GazeFusion understands and exploits a diversity of factors inducing visual saliency, including low-level image features, such as color, contrast, frequency, orientation, and layout, and high-level semantic information, such as objects, texts, and faces; \ref{['fig:teaser-apple']} demonstrates that GazeFusion can flexibly manipulate viewers' visual attention within the generated images by purely adjusting the color and contrast of specific image content while precisely following the user-specified layout.
  • Figure 2: Saliency-guided image generation with GazeFusion. Given randomly sampled noisy images as inputs, GazeFusion conditions the denoising process on user-specified saliency maps and text prompts such that the image features and semantic content in the generated images can trigger similarly distributed viewer attention.
  • Figure 3: Saliency-guided image generation with GazeFusion. During the generation process, GazeFusion leverages a variety of factors that affect visual saliency, such as color (e.g., the tomato and the apple scenes), frequency (e.g., the maze and the lake scenes), contrast (e.g., the fur and the marble texture scenes), orientation (e.g., the wood texture and the macaron scenes), layout (e.g., the last 3 rows), and high-level semantic information (e.g., the snake and the bird scenes).
  • Figure 4: Saliency-guided video generation with GazeFusion. By leveraging an off-the-shelf zero-shot video generation pipeline, GazeFusion can be extended to generate temporally consistent video clips with spatial-temporal saliency guidance.
  • Figure 5: User study setup. We used a Tobii Pro Spark eye tracker to record study participants' eye-gaze directions while they browsed through a sequence of generated images.
  • ...and 8 more figures