Table of Contents
Fetching ...

Optimizing Resource Consumption in Diffusion Models through Hallucination Early Detection

Federico Betti, Lorenzo Baraldi, Lorenzo Baraldi, Rita Cucchiara, Nicu Sebe

TL;DR

This work introduces HEaD (Hallucination Early Detection), a new paradigm designed to swiftly detect incorrect generations at the beginning of the diffusion process and reveals that HEaD can save up to 12% of the generation time on a two objects scenario and underscores the importance of early detection mechanisms in generative models.

Abstract

Diffusion models have significantly advanced generative AI, but they encounter difficulties when generating complex combinations of multiple objects. As the final result heavily depends on the initial seed, accurately ensuring the desired output can require multiple iterations of the generation process. This repetition not only leads to a waste of time but also increases energy consumption, echoing the challenges of efficiency and accuracy in complex generative tasks. To tackle this issue, we introduce HEaD (Hallucination Early Detection), a new paradigm designed to swiftly detect incorrect generations at the beginning of the diffusion process. The HEaD pipeline combines cross-attention maps with a new indicator, the Predicted Final Image, to forecast the final outcome by leveraging the information available at early stages of the generation process. We demonstrate that using HEaD saves computational resources and accelerates the generation process to get a complete image, i.e. an image where all requested objects are accurately depicted. Our findings reveal that HEaD can save up to 12% of the generation time on a two objects scenario and underscore the importance of early detection mechanisms in generative models.

Optimizing Resource Consumption in Diffusion Models through Hallucination Early Detection

TL;DR

This work introduces HEaD (Hallucination Early Detection), a new paradigm designed to swiftly detect incorrect generations at the beginning of the diffusion process and reveals that HEaD can save up to 12% of the generation time on a two objects scenario and underscores the importance of early detection mechanisms in generative models.

Abstract

Diffusion models have significantly advanced generative AI, but they encounter difficulties when generating complex combinations of multiple objects. As the final result heavily depends on the initial seed, accurately ensuring the desired output can require multiple iterations of the generation process. This repetition not only leads to a waste of time but also increases energy consumption, echoing the challenges of efficiency and accuracy in complex generative tasks. To tackle this issue, we introduce HEaD (Hallucination Early Detection), a new paradigm designed to swiftly detect incorrect generations at the beginning of the diffusion process. The HEaD pipeline combines cross-attention maps with a new indicator, the Predicted Final Image, to forecast the final outcome by leveraging the information available at early stages of the generation process. We demonstrate that using HEaD saves computational resources and accelerates the generation process to get a complete image, i.e. an image where all requested objects are accurately depicted. Our findings reveal that HEaD can save up to 12% of the generation time on a two objects scenario and underscore the importance of early detection mechanisms in generative models.
Paper Structure (12 sections, 6 equations, 5 figures, 2 tables)

This paper contains 12 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the HEaD pipeline: during the generation process, HEaD assesses whether all designated objects will be accurately represented in the final image, determining if the generation process should continue or be restarted with a different seed.
  • Figure 2: This figure illustrates the process of extracting subjects, cross-attention maps and PFI at each critical timestep $t_{c} \in \mathcal{T}$. These elements serve as inputs for the HP network, which evaluates the presence of objects in the final image. For the depicted seed, the bench appears in the final image, whereas the dolphin does not.
  • Figure 3: Qualitative examples of the Predicted Final Image for each prompt at different critical timesteps. Already from the 16th step the final image is fully represented and the presence of objects can be predicted.
  • Figure 4: Recall and TN-rate values for HP-R across various $t_{c_k}$. Lower $t_{c_k}$ values, associated with lower quality input, significantly impact the TN-Rate but minimally affect Recall. Consequently, the overall time saved tends to be greater for smaller $t_{c_k}$ values.
  • Figure 5: Relative time saving between adopting or not the HEaD approach to reach a complete generation, using HP-R with different $t_{c_k}$, depending on the probability of a correct image generation. The vertical red line marks the probability of correct generation in a two-objects scenario, i.e. 59%.