Table of Contents
Fetching ...

Weak Supervision Dynamic KL-Weighted Diffusion Models Guided by Large Language Models

Julian Perry, Frank Sanders, Carter Scott

TL;DR

This paper addresses the challenge of producing high-quality, text-aligned images efficiently by integrating Large Language Models (LLMs) with diffusion models. It introduces a dynamic KL-weighting scheme and a weak-to-strong LLM guidance strategy to progressively impose semantic structure during diffusion, implemented via time-conditioned text embeddings and cross-attention. Empirical results on COCO show superior FID and IS scores, as well as favorable human evaluations, compared to GAN-based and diffusion baselines, with ablations confirming the critical roles of LLM guidance and dynamic KL weighting. The approach demonstrates robustness to language variability and scalability to larger multimodal datasets, highlighting potential for broader applications beyond text-to-image generation.

Abstract

In this paper, we presents a novel method for improving text-to-image generation by combining Large Language Models (LLMs) with diffusion models, a hybrid approach aimed at achieving both higher quality and efficiency in image synthesis from text descriptions. Our approach introduces a new dynamic KL-weighting strategy to optimize the diffusion process, along with incorporating semantic understanding from pre-trained LLMs to guide the generation process. The proposed method significantly improves both the visual quality and alignment of generated images with text descriptions, addressing challenges such as computational inefficiency, instability in training, and robustness to textual variability. We evaluate our method on the COCO dataset and demonstrate its superior performance over traditional GAN-based models, both quantitatively and qualitatively. Extensive experiments, including ablation studies and human evaluations, confirm that our method outperforms existing approaches in terms of image realism, relevance to the input text, and overall aesthetic quality. Our approach also shows promise in scalability to other multimodal tasks, making it a versatile solution for a wide range of generative applications.

Weak Supervision Dynamic KL-Weighted Diffusion Models Guided by Large Language Models

TL;DR

This paper addresses the challenge of producing high-quality, text-aligned images efficiently by integrating Large Language Models (LLMs) with diffusion models. It introduces a dynamic KL-weighting scheme and a weak-to-strong LLM guidance strategy to progressively impose semantic structure during diffusion, implemented via time-conditioned text embeddings and cross-attention. Empirical results on COCO show superior FID and IS scores, as well as favorable human evaluations, compared to GAN-based and diffusion baselines, with ablations confirming the critical roles of LLM guidance and dynamic KL weighting. The approach demonstrates robustness to language variability and scalability to larger multimodal datasets, highlighting potential for broader applications beyond text-to-image generation.

Abstract

In this paper, we presents a novel method for improving text-to-image generation by combining Large Language Models (LLMs) with diffusion models, a hybrid approach aimed at achieving both higher quality and efficiency in image synthesis from text descriptions. Our approach introduces a new dynamic KL-weighting strategy to optimize the diffusion process, along with incorporating semantic understanding from pre-trained LLMs to guide the generation process. The proposed method significantly improves both the visual quality and alignment of generated images with text descriptions, addressing challenges such as computational inefficiency, instability in training, and robustness to textual variability. We evaluate our method on the COCO dataset and demonstrate its superior performance over traditional GAN-based models, both quantitatively and qualitatively. Extensive experiments, including ablation studies and human evaluations, confirm that our method outperforms existing approaches in terms of image realism, relevance to the input text, and overall aesthetic quality. Our approach also shows promise in scalability to other multimodal tasks, making it a versatile solution for a wide range of generative applications.

Paper Structure

This paper contains 23 sections, 6 equations, 5 tables.