Table of Contents
Fetching ...

Align, Adapt and Inject: Sound-guided Unified Image Generation

Yue Yang, Kaipeng Zhang, Yuying Ge, Wenqi Shao, Zeyue Xue, Yu Qiao, Ping Luo

TL;DR

This work addresses the challenge of using sound as a guidance signal for diffusion-based image generation, editing, and stylization. It introduces Align, Adapt and Inject (AAI), a three-stage framework that first aligns audio with text and vision through a tri-modal encoder, then maps the aligned audio into a semantic token $A^*$ via an audio adapter, and finally injects $A^*$ into frozen T2I models to enable multi-task sound-guided generation. The approach yields state-of-the-art results in audio-visual retrieval and competitive audio-text alignment, while providing superior sound-guided image generation quality compared to text- or sound-based baselines, and it enables efficient plug-and-play use of large diffusion models. This work broadens the practical scope of AIGC by enabling multi-task, sound-guided manipulation with reduced data and computation needs. Overall, AAI demonstrates the viability of token-based audio guidance and sets a path for finer-grained audio representations in future diffusion-based systems.

Abstract

Text-guided image generation has witnessed unprecedented progress due to the development of diffusion models. Beyond text and image, sound is a vital element within the sphere of human perception, offering vivid representations and naturally coinciding with corresponding scenes. Taking advantage of sound therefore presents a promising avenue for exploration within image generation research. However, the relationship between audio and image supervision remains significantly underdeveloped, and the scarcity of related, high-quality datasets brings further obstacles. In this paper, we propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization. In particular, our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing powerful diffusion-based Text-to-Image (T2I) models. Specifically, we first train a multi-modal encoder to align audio representation with the pre-trained textual manifold and visual manifold, respectively. Then, we propose the audio adapter to adapt audio representation into an audio token enriched with specific semantics, which can be injected into a frozen T2I model flexibly. In this way, we are able to extract the dynamic information of varied sounds, while utilizing the formidable capability of existing T2I models to facilitate sound-guided image generation, editing, and stylization in a convenient and cost-effective manner. The experiment results confirm that our proposed AAI outperforms other text and sound-guided state-of-the-art methods. And our aligned multi-modal encoder is also competitive with other approaches in the audio-visual retrieval and audio-text retrieval tasks.

Align, Adapt and Inject: Sound-guided Unified Image Generation

TL;DR

This work addresses the challenge of using sound as a guidance signal for diffusion-based image generation, editing, and stylization. It introduces Align, Adapt and Inject (AAI), a three-stage framework that first aligns audio with text and vision through a tri-modal encoder, then maps the aligned audio into a semantic token via an audio adapter, and finally injects into frozen T2I models to enable multi-task sound-guided generation. The approach yields state-of-the-art results in audio-visual retrieval and competitive audio-text alignment, while providing superior sound-guided image generation quality compared to text- or sound-based baselines, and it enables efficient plug-and-play use of large diffusion models. This work broadens the practical scope of AIGC by enabling multi-task, sound-guided manipulation with reduced data and computation needs. Overall, AAI demonstrates the viability of token-based audio guidance and sets a path for finer-grained audio representations in future diffusion-based systems.

Abstract

Text-guided image generation has witnessed unprecedented progress due to the development of diffusion models. Beyond text and image, sound is a vital element within the sphere of human perception, offering vivid representations and naturally coinciding with corresponding scenes. Taking advantage of sound therefore presents a promising avenue for exploration within image generation research. However, the relationship between audio and image supervision remains significantly underdeveloped, and the scarcity of related, high-quality datasets brings further obstacles. In this paper, we propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization. In particular, our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing powerful diffusion-based Text-to-Image (T2I) models. Specifically, we first train a multi-modal encoder to align audio representation with the pre-trained textual manifold and visual manifold, respectively. Then, we propose the audio adapter to adapt audio representation into an audio token enriched with specific semantics, which can be injected into a frozen T2I model flexibly. In this way, we are able to extract the dynamic information of varied sounds, while utilizing the formidable capability of existing T2I models to facilitate sound-guided image generation, editing, and stylization in a convenient and cost-effective manner. The experiment results confirm that our proposed AAI outperforms other text and sound-guided state-of-the-art methods. And our aligned multi-modal encoder is also competitive with other approaches in the audio-visual retrieval and audio-text retrieval tasks.
Paper Structure (45 sections, 10 equations, 16 figures, 3 tables)

This paper contains 45 sections, 10 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Examples from our unified strategy AAI. $A^*$ is the audio token for each audio. Our method provides various capabilities based on sound inputs. The user can use the input sound for image generation (left), image editing (middle), and image stylization (right) flexibly.
  • Figure 2: (a) Overview of Sound-guided Unified Generative Model: Audio representation alignment, adaption, and injection. (b) Contrastive Multi-modal Alignment Stage. (c) Audio-Representation Adaption Stage. There are seriatim elaborate descriptions in Section \ref{['method']}.
  • Figure 3: Results of the image generation given different audio inputs.
  • Figure 4: Results of the image editing (a) and stylization (b) given different audio inputs.
  • Figure 5: Qualitative comparisons to alternative personalized creation approaches. (a) Our model can more accurately capture the semantics of the audio, enabling generate images which typically more faithful images to the input. (b) Human evaluation results comparing Ours vs. other existing models.
  • ...and 11 more figures