Table of Contents
Fetching ...

Generative AI for Vision: A Comprehensive Study of Frameworks and Applications

Fouad Bousetouane

TL;DR

This paper surveys generative AI for vision through an input-centric taxonomy, organizing image-generation techniques by input type (noisy vectors, latent representations, conditional inputs, and textual prompts). It covers GANs and diffusion models, VAEs, and prompt-to-image frameworks like Stable Diffusion, DALL-E, and Janus-Pro, including key derivatives such as ControlNet and related adapters. It discusses practical applications across design, healthcare, and autonomous systems, while engaging challenges around bias, computational cost, and user-intent alignment, and outlines future directions in multimodal alignment, scalability, ethics, and agentic vision. The work provides a structured resource that connects foundational models to real-world workflows, enabling researchers and practitioners to advance robust, controllable, and ethically aligned vision-generation systems.

Abstract

Generative AI is transforming image synthesis, enabling the creation of high-quality, diverse, and photorealistic visuals across industries like design, media, healthcare, and autonomous systems. Advances in techniques such as image-to-image translation, text-to-image generation, domain transfer, and multimodal alignment have broadened the scope of automated visual content creation, supporting a wide spectrum of applications. These advancements are driven by models like Generative Adversarial Networks (GANs), conditional frameworks, and diffusion-based approaches such as Stable Diffusion. This work presents a structured classification of image generation techniques based on the nature of the input, organizing methods by input modalities like noisy vectors, latent representations, and conditional inputs. We explore the principles behind these models, highlight key frameworks including DALL-E, ControlNet, and DeepSeek Janus-Pro, and address challenges such as computational costs, data biases, and output alignment with user intent. By offering this input-centric perspective, this study bridges technical depth with practical insights, providing researchers and practitioners with a comprehensive resource to harness generative AI for real-world applications.

Generative AI for Vision: A Comprehensive Study of Frameworks and Applications

TL;DR

This paper surveys generative AI for vision through an input-centric taxonomy, organizing image-generation techniques by input type (noisy vectors, latent representations, conditional inputs, and textual prompts). It covers GANs and diffusion models, VAEs, and prompt-to-image frameworks like Stable Diffusion, DALL-E, and Janus-Pro, including key derivatives such as ControlNet and related adapters. It discusses practical applications across design, healthcare, and autonomous systems, while engaging challenges around bias, computational cost, and user-intent alignment, and outlines future directions in multimodal alignment, scalability, ethics, and agentic vision. The work provides a structured resource that connects foundational models to real-world workflows, enabling researchers and practitioners to advance robust, controllable, and ethically aligned vision-generation systems.

Abstract

Generative AI is transforming image synthesis, enabling the creation of high-quality, diverse, and photorealistic visuals across industries like design, media, healthcare, and autonomous systems. Advances in techniques such as image-to-image translation, text-to-image generation, domain transfer, and multimodal alignment have broadened the scope of automated visual content creation, supporting a wide spectrum of applications. These advancements are driven by models like Generative Adversarial Networks (GANs), conditional frameworks, and diffusion-based approaches such as Stable Diffusion. This work presents a structured classification of image generation techniques based on the nature of the input, organizing methods by input modalities like noisy vectors, latent representations, and conditional inputs. We explore the principles behind these models, highlight key frameworks including DALL-E, ControlNet, and DeepSeek Janus-Pro, and address challenges such as computational costs, data biases, and output alignment with user intent. By offering this input-centric perspective, this study bridges technical depth with practical insights, providing researchers and practitioners with a comprehensive resource to harness generative AI for real-world applications.

Paper Structure

This paper contains 69 sections, 2 equations, 16 figures.

Figures (16)

  • Figure 1: Key categories of input-driven image generation techniques. These categories illustrate the distinct workflows and methodologies involved in generating compelling visual outputs
  • Figure 2: Adversarial training setup for a Vanilla GAN. The generator ($G$) maps noise vectors to generated images, while the discriminator ($D$) evaluates whether inputs are real or generated. The adversarial process ensures that both networks improve iteratively.
  • Figure 3: Pix2Pix supports diverse paired image-to-image translation tasks, such as labels-to-street scenes, aerial images to maps, black-and-white to color, and edges to photos. These examples highlight its versatility in addressing various vision challenges isola2017image.
  • Figure 4: Training process where the generator ($G$) and discriminator ($D$) observe the input edge map to classify real and fake pairs isola2017image.
  • Figure 5: CycleGAN enables unpaired image-to-image translation tasks, such as transforming horse images to zebras, and vice versa, or changing the style of landscapes between summer and winter CycleGAN2017.
  • ...and 11 more figures