Table of Contents
Fetching ...

Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances

Yuanzhi Liang, Yijie Fang, Rui Li, Ziqi Ni, Ruijie Su, Chi Zhang

TL;DR

This survey addresses the misalignment between common surrogate objectives and perceptual, semantic, and physical realism in visual generation. It positions reinforcement learning as a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives, and organizes contemporary advances across image, video, and 3D generation. Key contributions include a structured account of RL’s evolution, a taxonomy of RL-enhanced generation methods (PPO-based, DPO-based, GRPO-based), and insights into mechanisms, human-alignment strategies, and world-model integration. The work highlights the practical impact of RL in improving controllability, temporal consistency, and human-aligned quality, while outlining open challenges and promising directions for future research at the intersection of RL and visual generative modeling.

Abstract

Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.

Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances

TL;DR

This survey addresses the misalignment between common surrogate objectives and perceptual, semantic, and physical realism in visual generation. It positions reinforcement learning as a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives, and organizes contemporary advances across image, video, and 3D generation. Key contributions include a structured account of RL’s evolution, a taxonomy of RL-enhanced generation methods (PPO-based, DPO-based, GRPO-based), and insights into mechanisms, human-alignment strategies, and world-model integration. The work highlights the practical impact of RL in improving controllability, temporal consistency, and human-aligned quality, while outlining open challenges and promising directions for future research at the intersection of RL and visual generative modeling.

Abstract

Generative models have made significant progress in synthesizing visual content, including images, videos, and 3D/4D structures. However, they are typically trained with surrogate objectives such as likelihood or reconstruction loss, which often misalign with perceptual quality, semantic accuracy, or physical realism. Reinforcement learning (RL) offers a principled framework for optimizing non-differentiable, preference-driven, and temporally structured objectives. Recent advances demonstrate its effectiveness in enhancing controllability, consistency, and human alignment across generative tasks. This survey provides a systematic overview of RL-based methods for visual content generation. We review the evolution of RL from classical control to its role as a general-purpose optimization tool, and examine its integration into image, video, and 3D/4D generation. Across these domains, RL serves not only as a fine-tuning mechanism but also as a structural component for aligning generation with complex, high-level goals. We conclude with open challenges and future research directions at the intersection of RL and generative modeling.

Paper Structure

This paper contains 24 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Growth of publications at the intersection of reinforcement learning and visual content generation (2019–2025). The field has experienced exponential growth, increasing from 13 papers in 2019–2020 to 91 in 2024–2025 (as of July 30). With 77 papers already published in the first half of 2025, the year is projected to exceed 140 publications. This trend reflects the field’s transition from exploration to consolidation and its growing strategic importance in visual generation research.
  • Figure 2: This figure presents the proportional breakdown of recent research topics applying RL to 3D generation, including Text-to-NeRF/3D Gaussian Splatting, 3D diffusion models, multi-view consistency optimization, human motion synthesis, point cloud modeling, and others. The balanced distribution across these areas reflects a rapidly emerging and diversifying field, where no single paradigm dominates. This suggests that RL is being broadly explored as a general-purpose optimization tool across the full spectrum of 3D generation tasks.