Table of Contents
Fetching ...

Data Augmentation via Latent Diffusion for Saliency Prediction

Bahar Aydemir, Deblina Bhattacharjee, Tong Zhang, Mathieu Salzmann, Sabine Süsstrunk

TL;DR

This work introduces a saliency-guided cross-attention mechanism that enables targeted edits on the photometric properties, thereby enhancing saliency within specific image regions, and shows that the data augmentation method consistently improves the performance of various saliency models.

Abstract

Saliency prediction models are constrained by the limited diversity and quantity of labeled data. Standard data augmentation techniques such as rotating and cropping alter scene composition, affecting saliency. We propose a novel data augmentation method for deep saliency prediction that edits natural images while preserving the complexity and variability of real-world scenes. Since saliency depends on high-level and low-level features, our approach involves learning both by incorporating photometric and semantic attributes such as color, contrast, brightness, and class. To that end, we introduce a saliency-guided cross-attention mechanism that enables targeted edits on the photometric properties, thereby enhancing saliency within specific image regions. Experimental results show that our data augmentation method consistently improves the performance of various saliency models. Moreover, leveraging the augmentation features for saliency prediction yields superior performance on publicly available saliency benchmarks. Our predictions align closely with human visual attention patterns in the edited images, as validated by a user study.

Data Augmentation via Latent Diffusion for Saliency Prediction

TL;DR

This work introduces a saliency-guided cross-attention mechanism that enables targeted edits on the photometric properties, thereby enhancing saliency within specific image regions, and shows that the data augmentation method consistently improves the performance of various saliency models.

Abstract

Saliency prediction models are constrained by the limited diversity and quantity of labeled data. Standard data augmentation techniques such as rotating and cropping alter scene composition, affecting saliency. We propose a novel data augmentation method for deep saliency prediction that edits natural images while preserving the complexity and variability of real-world scenes. Since saliency depends on high-level and low-level features, our approach involves learning both by incorporating photometric and semantic attributes such as color, contrast, brightness, and class. To that end, we introduce a saliency-guided cross-attention mechanism that enables targeted edits on the photometric properties, thereby enhancing saliency within specific image regions. Experimental results show that our data augmentation method consistently improves the performance of various saliency models. Moreover, leveraging the augmentation features for saliency prediction yields superior performance on publicly available saliency benchmarks. Our predictions align closely with human visual attention patterns in the edited images, as validated by a user study.
Paper Structure (25 sections, 12 equations, 6 figures, 2 tables)

This paper contains 25 sections, 12 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of our data augmentation and saliency prediction method. We use photometric and semantic properties such as color, contrast brightness, and class to construct low-level and high-level features from the intermediate U-Net features. We leverage these features to estimate saliency and generate image edits to augment training data for saliency prediction. We use a saliency-guided cross-attention mechanism between the input image and the text prompt to perform localized edits.
  • Figure 2: Overview of the proposed architecture in training. We use an encoder ($\mathcal{E}$), a decoder ($\mathcal{D}$), and a denoising U-Net from Stable Diffusion rombach2021ldm. We invert the input image to extract encoded representations from the U-Net to construct low-level and high-level features. We train the Low-Level Feature Readout (LLFR) and High-Level Feature Readout (LLFR) modules using related photometric and semantic properties inside the area of interest (AoI), respectively. Finally, we concatenate the low-level and high-level features to predict the saliency map ($\mathbf{S}'$) using the Saliency Readout (SR) module. Z and Z' represent the encoded and denoised image latent vectors respectively.
  • Figure 3: Overview of the proposed image editing architecture. We extract cross-attention maps between the input image and the prompt, and then multiply with the saliency map to create spatial attention maps. These maps highlight the intersections of salient regions with elements from the prompt. We inject this map into the denoising U-Net alongside the edited prompt. This integration modifies the latent features $\mathbf{Z}'$ and the extracted multi-level features, resulting in the generating edits. We use frozen readout layers to constrain the edits in terms of those features.
  • Figure 4: Original image, selected editing region, and edited images at different intensity levels for contrast, brightness, and color edits. In the first row, we increase the contrast in the bread. The second row shows an increase in the brightness of the red bus. The last row shows a progressive color edit of the chair to purple. These edits aim to increase the saliency of the target region shown in the second column.
  • Figure 5: (Top Row): Original and edited images with the (Middle Row:) ground-truth saliency maps from SALICON salicon, shown in blue and from our user study, shown in green. We report that our generated image edits can shift human attention toward the edited region. For instance, by enhancing the contrast of the horse in the image to the left, we observe that the attention focuses on the horse as the intensity of the edit increases. Similarly, by enhancing the brightness of the cars, they gather more attention as their brightness increases. (Bottom Row:) Our saliency prediction model can achieve saliency estimations that align with the ground-truth maps.
  • ...and 1 more figures