Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models
Emily Johnson, Noah Wilson
TL;DR
VLAD addresses the core challenge of aligning complex textual prompts with high-quality images by fusing semantic alignment from LVLM embeddings with a hierarchical diffusion pipeline. The Contextual Composition Module decomposes prompts into global and local semantics, while a two-stage diffusion with Text Layout Generator and Visual Feature Enhancer provides structured image synthesis guided by textual meaning. Efficient training is achieved via LoRA, enabling scalable fine-tuning on diverse prompts; quantitative and human evaluations on MARIO-Eval and INNOVATOR-Eval show VLAD achieving strong image fidelity, accurate text rendering, and solid semantic alignment compared with state-of-the-art methods. The work presents a practical, adaptable framework for complex text-to-image generation and highlights components (CCM, hierarchical guidance, LoRA) as critical to performance and efficiency.
Abstract
Text-to-image generation has witnessed significant advancements with the integration of Large Vision-Language Models (LVLMs), yet challenges remain in aligning complex textual descriptions with high-quality, visually coherent images. This paper introduces the Vision-Language Aligned Diffusion (VLAD) model, a generative framework that addresses these challenges through a dual-stream strategy combining semantic alignment and hierarchical diffusion. VLAD utilizes a Contextual Composition Module (CCM) to decompose textual prompts into global and local representations, ensuring precise alignment with visual features. Furthermore, it incorporates a multi-stage diffusion process with hierarchical guidance to generate high-fidelity images. Experiments conducted on MARIO-Eval and INNOVATOR-Eval benchmarks demonstrate that VLAD significantly outperforms state-of-the-art methods in terms of image quality, semantic alignment, and text rendering accuracy. Human evaluations further validate the superior performance of VLAD, making it a promising approach for text-to-image generation in complex scenarios.
