Table of Contents
Fetching ...

Diffusion-based Visual Anagram as Multi-task Learning

Zhiyuan Xu, Yinhe Chen, Huan-ang Gao, Weiyan Zhao, Guiyu Zhang, Hao Zhao

TL;DR

This work reframes diffusion-based visual anagram generation as a multi-task learning problem across multiple views to mitigate concept segregation and domination. It introduces Anti-Segregation Optimization to encourage shared attention across prompts, Noise Vector Balancing to adaptively weight per-view noise trajectories, and Noise Variance Rectification to preserve diffusion statistics. Empirical results on CIFAR-10 prompts demonstrate improved alignment and robustness over baselines in both 2-view and 3-view settings, with ablations confirming complementary contributions. The methods yield more faithful visual anagrams across diverse concepts and viewpoints, supporting applications in design and cognitive studies, while noting limitations in latent-space consistency and transformation types.

Abstract

Visual anagrams are images that change appearance upon transformation, like flipping or rotation. With the advent of diffusion models, generating such optical illusions can be achieved by averaging noise across multiple views during the reverse denoising process. However, we observe two critical failure modes in this approach: (i) concept segregation, where concepts in different views are independently generated, which can not be considered a true anagram, and (ii) concept domination, where certain concepts overpower others. In this work, we cast the visual anagram generation problem in a multi-task learning setting, where different viewpoint prompts are analogous to different tasks,and derive denoising trajectories that align well across tasks simultaneously. At the core of our designed framework are two newly introduced techniques, where (i) an anti-segregation optimization strategy that promotes overlap in cross-attention maps between different concepts, and (ii) a noise vector balancing method that adaptively adjusts the influence of different tasks. Additionally, we observe that directly averaging noise predictions yields suboptimal performance because statistical properties may not be preserved, prompting us to derive a noise variance rectification method. Extensive qualitative and quantitative experiments demonstrate our method's superior ability to generate visual anagrams spanning diverse concepts.

Diffusion-based Visual Anagram as Multi-task Learning

TL;DR

This work reframes diffusion-based visual anagram generation as a multi-task learning problem across multiple views to mitigate concept segregation and domination. It introduces Anti-Segregation Optimization to encourage shared attention across prompts, Noise Vector Balancing to adaptively weight per-view noise trajectories, and Noise Variance Rectification to preserve diffusion statistics. Empirical results on CIFAR-10 prompts demonstrate improved alignment and robustness over baselines in both 2-view and 3-view settings, with ablations confirming complementary contributions. The methods yield more faithful visual anagrams across diverse concepts and viewpoints, supporting applications in design and cognitive studies, while noting limitations in latent-space consistency and transformation types.

Abstract

Visual anagrams are images that change appearance upon transformation, like flipping or rotation. With the advent of diffusion models, generating such optical illusions can be achieved by averaging noise across multiple views during the reverse denoising process. However, we observe two critical failure modes in this approach: (i) concept segregation, where concepts in different views are independently generated, which can not be considered a true anagram, and (ii) concept domination, where certain concepts overpower others. In this work, we cast the visual anagram generation problem in a multi-task learning setting, where different viewpoint prompts are analogous to different tasks,and derive denoising trajectories that align well across tasks simultaneously. At the core of our designed framework are two newly introduced techniques, where (i) an anti-segregation optimization strategy that promotes overlap in cross-attention maps between different concepts, and (ii) a noise vector balancing method that adaptively adjusts the influence of different tasks. Additionally, we observe that directly averaging noise predictions yields suboptimal performance because statistical properties may not be preserved, prompting us to derive a noise variance rectification method. Extensive qualitative and quantitative experiments demonstrate our method's superior ability to generate visual anagrams spanning diverse concepts.

Paper Structure

This paper contains 19 sections, 9 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Visual Anagrams. We show an example of visual anagrams, which can be perceived as a garden gnome or a hot air balloon depending on the orientation of the image.
  • Figure 2: Failure Cases. We show two common failure cases of geng2024visual: concept segregation (left) and domination (right).
  • Figure 3: Method overview. During each denoising step, the intermediate image $x_t$ first passes through the diffusion model together with the corresponding text prompt under each view, and also through a noise-aware CLIP model which measures the degree of task completion for each view. (1) Noise Vector Balancing: Predicted noise vectors are reweighted based on the degree of task completion before being combined, see \ref{['sec:noise-balancing']}. (2) Noise Variance Rectification: Combined noise vectors are rectified by applying a scale factor calculated based on estimated correlation coefficients, which is detailed in \ref{['sec:noise-unitization']}. (3) Anti-Segregation Optimization: The denoised image $x_{t-1}'$ is modulated to encourage intersection of attention maps of different views with an inference-time loss term before being passed to the next denoising step, refer to \ref{['sec:anti-se']}.
  • Figure 4: More qualitative results of our proposed method compared to the baseline methodgeng2024visual
  • Figure 5: Qualitative results of Anti-Segregation Optimization and visualization of attention maps.
  • ...and 6 more figures