Table of Contents
Fetching ...

Controlling Neural Style Transfer with Deep Reinforcement Learning

Chengming Feng, Jing Hu, Xin Wang, Shu Hu, Bin Zhu, Xi Wu, Hongtu Zhu, Siwei Lyu

TL;DR

This paper addresses the challenge of controllable stylization in Neural Style Transfer by introducing RL-NST, a reinforcement-learning–based framework that decomposes NST into step-wise, progressive decisions. An actor-critic architecture samples 2D latent actions to steer a stylizer, enabling flexible, user-controlled stylization while maintaining content fidelity and reducing computation relative to one-step DL models. The method combines perceptual content and style losses with temporal regularization to support both image and video NST, and validates its effectiveness through extensive experiments and ablations, showing improvements in stability, quality, and efficiency. The work advances NST by integrating progressive RL control, offering practical benefits for real-time or resource-constrained applications in art-style transfer and video stylization.

Abstract

Controlling the degree of stylization in the Neural Style Transfer (NST) is a little tricky since it usually needs hand-engineering on hyper-parameters. In this paper, we propose the first deep Reinforcement Learning (RL) based architecture that splits one-step style transfer into a step-wise process for the NST task. Our RL-based method tends to preserve more details and structures of the content image in early steps, and synthesize more style patterns in later steps. It is a user-easily-controlled style-transfer method. Additionally, as our RL-based model performs the stylization progressively, it is lightweight and has lower computational complexity than existing one-step Deep Learning (DL) based models. Experimental results demonstrate the effectiveness and robustness of our method.

Controlling Neural Style Transfer with Deep Reinforcement Learning

TL;DR

This paper addresses the challenge of controllable stylization in Neural Style Transfer by introducing RL-NST, a reinforcement-learning–based framework that decomposes NST into step-wise, progressive decisions. An actor-critic architecture samples 2D latent actions to steer a stylizer, enabling flexible, user-controlled stylization while maintaining content fidelity and reducing computation relative to one-step DL models. The method combines perceptual content and style losses with temporal regularization to support both image and video NST, and validates its effectiveness through extensive experiments and ablations, showing improvements in stability, quality, and efficiency. The work advances NST by integrating progressive RL control, offering practical benefits for real-time or resource-constrained applications in art-style transfer and video stylization.

Abstract

Controlling the degree of stylization in the Neural Style Transfer (NST) is a little tricky since it usually needs hand-engineering on hyper-parameters. In this paper, we propose the first deep Reinforcement Learning (RL) based architecture that splits one-step style transfer into a step-wise process for the NST task. Our RL-based method tends to preserve more details and structures of the content image in early steps, and synthesize more style patterns in later steps. It is a user-easily-controlled style-transfer method. Additionally, as our RL-based model performs the stylization progressively, it is lightweight and has lower computational complexity than existing one-step Deep Learning (DL) based models. Experimental results demonstrate the effectiveness and robustness of our method.
Paper Structure (28 sections, 9 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 28 sections, 9 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of our step-wise style transfer process. Content images are stylized smoothly stronger along with prediction steps. Our step-wise method can easily control the degree of stylization: the model tends to preserve more details and structures of the content image in early steps, and synthesize more style patterns in later steps. It is a user-easily-controlled style transfer method.
  • Figure 2: Our RL-NST framework. Left: The state is initialized with the content image (or video frame). After the first iteration, we use only the moving image as the state. Latent-action ${\bf a}_t$ is sampled from a 2D Gaussian distribution and is concatenated with the critic's output. It is estimated by the policy $\pi_\phi$: ${\bf a_t} \sim \pi_\phi({\bf a}_t|{\bf s}_t)$. The predicted moving image is generated by stylizer $\eta_\psi$. Note that the VGG networks are pre-trained and fixed for the feature extraction during the training process. Right: The structure of the actor and stylizer for image and video NST, respectively. More details of the network structure can be found in Appendix.
  • Figure 3: Qualitative comparison. The first two columns show the content and style images, respectively. The rest of the columns show the stylization results generated with different style transfer methods, with the last two columns of our step-wise results at step 1 and 10, respectively.
  • Figure 4: Comparison of Ours with the Actor-Stylizer (AS) model at step 1 and Ours without the RL model at step 10.
  • Figure 5: Comparison of Ours with AdaIN and StyTR2 in various hyperparameter settings.
  • ...and 2 more figures