Table of Contents
Fetching ...

InstructRestore: Region-Customized Image Restoration with Human Instructions

Shuaizheng Liu, Jianqi Ma, Lingchen Sun, Xiangtao Kong, Lei Zhang

TL;DR

Existing diffusion-based image restoration methods apply uniform processing across the image and cannot honor region-specific human instructions. InstructRestore introduces a region-aware restoration framework that uses a ControlNet-like conditioning along with a region mask decoder, trained on a large 536,945-triplet dataset of HQ images, region masks, and region captions. Key contributions include a scalable data generation pipeline, a region-customized diffusion model, and demonstrations of localized enhancement and bokeh-preserving restoration under natural-language instructions. This work enables interactive, fine-grained image restoration with practical applications in photography and scene editing.

Abstract

Despite the significant progress in diffusion prior-based image restoration, most existing methods apply uniform processing to the entire image, lacking the capability to perform region-customized image restoration according to user instructions. In this work, we propose a new framework, namely InstructRestore, to perform region-adjustable image restoration following human instructions. To achieve this, we first develop a data generation engine to produce training triplets, each consisting of a high-quality image, the target region description, and the corresponding region mask. With this engine and careful data screening, we construct a comprehensive dataset comprising 536,945 triplets to support the training and evaluation of this task. We then examine how to integrate the low-quality image features under the ControlNet architecture to adjust the degree of image details enhancement. Consequently, we develop a ControlNet-like model to identify the target region and allocate different integration scales to the target and surrounding regions, enabling region-customized image restoration that aligns with user instructions. Experimental results demonstrate that our proposed InstructRestore approach enables effective human-instructed image restoration, such as images with bokeh effects and user-instructed local enhancement. Our work advances the investigation of interactive image restoration and enhancement techniques. Data, code, and models will be found at https://github.com/shuaizhengliu/InstructRestore.git.

InstructRestore: Region-Customized Image Restoration with Human Instructions

TL;DR

Existing diffusion-based image restoration methods apply uniform processing across the image and cannot honor region-specific human instructions. InstructRestore introduces a region-aware restoration framework that uses a ControlNet-like conditioning along with a region mask decoder, trained on a large 536,945-triplet dataset of HQ images, region masks, and region captions. Key contributions include a scalable data generation pipeline, a region-customized diffusion model, and demonstrations of localized enhancement and bokeh-preserving restoration under natural-language instructions. This work enables interactive, fine-grained image restoration with practical applications in photography and scene editing.

Abstract

Despite the significant progress in diffusion prior-based image restoration, most existing methods apply uniform processing to the entire image, lacking the capability to perform region-customized image restoration according to user instructions. In this work, we propose a new framework, namely InstructRestore, to perform region-adjustable image restoration following human instructions. To achieve this, we first develop a data generation engine to produce training triplets, each consisting of a high-quality image, the target region description, and the corresponding region mask. With this engine and careful data screening, we construct a comprehensive dataset comprising 536,945 triplets to support the training and evaluation of this task. We then examine how to integrate the low-quality image features under the ControlNet architecture to adjust the degree of image details enhancement. Consequently, we develop a ControlNet-like model to identify the target region and allocate different integration scales to the target and surrounding regions, enabling region-customized image restoration that aligns with user instructions. Experimental results demonstrate that our proposed InstructRestore approach enables effective human-instructed image restoration, such as images with bokeh effects and user-instructed local enhancement. Our work advances the investigation of interactive image restoration and enhancement techniques. Data, code, and models will be found at https://github.com/shuaizhengliu/InstructRestore.git.

Paper Structure

This paper contains 18 sections, 3 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Our proposed InstructionRestore framework enables region-customized restoration following human instruction. As shown in (a), current methods wu2024seesryu2024scaling tend to incorrectly restore the bokeh blur, while method allows for adjustable control over the degree of blur based on user instructions. In (b), existing methods fail to achieve region-specific enhancement intensities, while our approach can simultaneously suppress the over-enhancement in areas of building and improve the visual quality in areas of leaves.
  • Figure 2: Illustration of the annotation pipeline. For selected high-quality images, Semantic-SAM li2024segment generates initial masks, followed by Osprey yuan2024osprey for region-level descriptions. Qwen yang2024qwen2 reformats descriptions into noun phrases and extracts semantic subjects. Identical semantics are merged to produce final masks and region captions.
  • Figure 3: Subject distribution word cloud generated from region captions in our dataset. Word size corresponds to the relative frequency of extracted subjects.
  • Figure 4: Framework of InsturctRestore. The framework uses red and green arrows to denote training and inference processes respectively. During testing, user instructions are parsed to generate target-region semantic masks, with differentiated coefficient modulation applied to conditional features inside/outside mask regions, enabling instruction-guided region-adaptive restoration effects.
  • Figure 5: Localized enhancement following instruction on real-world test data. The details in flowers are enhanced gradually while the other regions keeping almost unchanged.
  • ...and 13 more figures