Table of Contents
Fetching ...

RL-I2IT: Image-to-Image Translation with Deep Reinforcement Learning

Jing Hu, Ziwei Luo, Chengming Feng, Shu Hu, Bin Zhu, Xi Wu, Xin Li, Hongtu Zhu, Siwei Lyu, Xin Wang

TL;DR

This work reframes image-to-image translation as a stepwise decision process by introducing RL-I2IT, a lightweight Planner-Actor-Critic framework guided by a latent, low-dimensional Plan. The stochastic meta-policy enabling state-to-plan and plan-to-action mappings addresses the challenge of high-dimensional continuous actions, with a critic evaluating plans rather than actions to stabilize learning. Across face inpainting, realistic photo translation, and neural style transfer, RL-I2IT achieves strong performance while remaining computationally efficient, outperforming several baselines and providing robust, intermediate outputs at each step. The study also introduces task-specific auxiliary learning and a flexible environment design, highlighting potential extensions to broader I2IT tasks and future improvements like adaptive stopping and temporal consistency for video tasks.

Abstract

Most existing Image-to-Image Translation (I2IT) methods generate images in a single run of a deep learning (DL) model. However, designing such a single-step model is always challenging, requiring a huge number of parameters and easily falling into bad global minimums and overfitting. In this work, we reformulate I2IT as a step-wise decision-making problem via deep reinforcement learning (DRL) and propose a novel framework that performs RL-based I2IT (RL-I2IT). The key feature in the RL-I2IT framework is to decompose a monolithic learning process into small steps with a lightweight model to progressively transform a source image successively to a target image. Considering that it is challenging to handle high dimensional continuous state and action spaces in the conventional RL framework, we introduce meta policy with a new concept Plan to the standard Actor-Critic model, which is of a lower dimension than the original image and can facilitate the actor to generate a tractable high dimensional action. In the RL-I2IT framework, we also employ a task-specific auxiliary learning strategy to stabilize the training process and improve the performance of the corresponding task. Experiments on several I2IT tasks demonstrate the effectiveness and robustness of the proposed method when facing high-dimensional continuous action space problems. Our implementation of the RL-I2IT framework is available at https://github.com/Algolzw/SPAC-Deformable-Registration.

RL-I2IT: Image-to-Image Translation with Deep Reinforcement Learning

TL;DR

This work reframes image-to-image translation as a stepwise decision process by introducing RL-I2IT, a lightweight Planner-Actor-Critic framework guided by a latent, low-dimensional Plan. The stochastic meta-policy enabling state-to-plan and plan-to-action mappings addresses the challenge of high-dimensional continuous actions, with a critic evaluating plans rather than actions to stabilize learning. Across face inpainting, realistic photo translation, and neural style transfer, RL-I2IT achieves strong performance while remaining computationally efficient, outperforming several baselines and providing robust, intermediate outputs at each step. The study also introduces task-specific auxiliary learning and a flexible environment design, highlighting potential extensions to broader I2IT tasks and future improvements like adaptive stopping and temporal consistency for video tasks.

Abstract

Most existing Image-to-Image Translation (I2IT) methods generate images in a single run of a deep learning (DL) model. However, designing such a single-step model is always challenging, requiring a huge number of parameters and easily falling into bad global minimums and overfitting. In this work, we reformulate I2IT as a step-wise decision-making problem via deep reinforcement learning (DRL) and propose a novel framework that performs RL-based I2IT (RL-I2IT). The key feature in the RL-I2IT framework is to decompose a monolithic learning process into small steps with a lightweight model to progressively transform a source image successively to a target image. Considering that it is challenging to handle high dimensional continuous state and action spaces in the conventional RL framework, we introduce meta policy with a new concept Plan to the standard Actor-Critic model, which is of a lower dimension than the original image and can facilitate the actor to generate a tractable high dimensional action. In the RL-I2IT framework, we also employ a task-specific auxiliary learning strategy to stabilize the training process and improve the performance of the corresponding task. Experiments on several I2IT tasks demonstrate the effectiveness and robustness of the proposed method when facing high-dimensional continuous action space problems. Our implementation of the RL-I2IT framework is available at https://github.com/Algolzw/SPAC-Deformable-Registration.
Paper Structure (27 sections, 21 equations, 10 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 21 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Top: I2I Problem translates an image from a source domain to a target domain. Mid: Example of one-step method CGAN isola2017image. Bottom: Our RL-based stepwise I2IT progressively transforms the source image, and the process is demonstrated clearly.
  • Figure 2: Our RL-I2IT framework with a Planner-Actor-Critic structure. Left: At time step $t$, the environment receives executable action ${\bf a}_t$, and outputs state and reward (${\bf s}_t, r_t$). In our meta policy, latent plan ${\bf p}_t$ is sampled from the planner to guide the actor to generate executable action ${\bf a}_t$ that interacts with the environment. The plan is also evaluated by the critic. The nature of ${\bf a}_t$ is also task-dependent, for tasks aiming at realistic image generation, such as face inpainting or neural style transfer, ${\bf a}_t$ could directly be the target image. Right: Task-specific auxiliary learning objectives depend on specific tasks for various purposes, such as stabilizing the training process or improving performance.
  • Figure 3: The network architecture of RL-I2IT for face inpainting. Each rectangle represents a 2D image (or feature map), the number of channels is shown inside the rectangle, and the responding resolution is printed underneath (or on the left for discriminator).
  • Figure 4: Visual comparison of different face inpainting methods. GT means ground truth. RL-I2IT uses SNGAN for auxiliary learning. $\#$ indicates what reward is used for RL training. Our results have good visual quality even for a large pose face.
  • Figure 5: Visual comparison of our RL-I2IT with pix2pix, PAN, pix2pixHD, and DRPAN over photo translation tasks.
  • ...and 5 more figures