Generative AI for Vision: A Comprehensive Study of Frameworks and Applications
Fouad Bousetouane
TL;DR
This paper surveys generative AI for vision through an input-centric taxonomy, organizing image-generation techniques by input type (noisy vectors, latent representations, conditional inputs, and textual prompts). It covers GANs and diffusion models, VAEs, and prompt-to-image frameworks like Stable Diffusion, DALL-E, and Janus-Pro, including key derivatives such as ControlNet and related adapters. It discusses practical applications across design, healthcare, and autonomous systems, while engaging challenges around bias, computational cost, and user-intent alignment, and outlines future directions in multimodal alignment, scalability, ethics, and agentic vision. The work provides a structured resource that connects foundational models to real-world workflows, enabling researchers and practitioners to advance robust, controllable, and ethically aligned vision-generation systems.
Abstract
Generative AI is transforming image synthesis, enabling the creation of high-quality, diverse, and photorealistic visuals across industries like design, media, healthcare, and autonomous systems. Advances in techniques such as image-to-image translation, text-to-image generation, domain transfer, and multimodal alignment have broadened the scope of automated visual content creation, supporting a wide spectrum of applications. These advancements are driven by models like Generative Adversarial Networks (GANs), conditional frameworks, and diffusion-based approaches such as Stable Diffusion. This work presents a structured classification of image generation techniques based on the nature of the input, organizing methods by input modalities like noisy vectors, latent representations, and conditional inputs. We explore the principles behind these models, highlight key frameworks including DALL-E, ControlNet, and DeepSeek Janus-Pro, and address challenges such as computational costs, data biases, and output alignment with user intent. By offering this input-centric perspective, this study bridges technical depth with practical insights, providing researchers and practitioners with a comprehensive resource to harness generative AI for real-world applications.
