Table of Contents
Fetching ...

Multi-task Image Restoration Guided By Robust DINO Features

Xin Lin, Jingtong Yue, Kelvin C. K. Chan, Lu Qi, Chao Ren, Jinshan Pan, Ming-Hsuan Yang

TL;DR

This work tackles the challenge of multi-task image restoration by leveraging degradation-agnostic semantic representations from the pre-trained DINOv2 model. It introduces DINO-IR, a framework that fuses DINOv2 features through Pixel-Semantic Fusion, adapts and merges them with a restoration backbone via DINO-R adaption and self-attention, and enforces robust guidance with a DINO perception contrastive loss. Across four degradations, DINO-IR outperforms state-of-the-art multi-task methods, especially on unseen data, while single-task performance remains competitive. The results suggest that robust features from large pre-trained vision models can significantly improve efficiency and performance in multi-task restoration, providing a practical route toward degradation-agnostic restoration systems.

Abstract

Multi-task image restoration has gained significant interest due to its inherent versatility and efficiency compared to its single-task counterpart. However, performance decline is observed with an increase in the number of tasks, primarily attributed to the restoration model's challenge in handling different tasks with distinct natures at the same time. Thus, a perspective emerged aiming to explore the degradation-insensitive semantic commonalities among different degradation tasks. In this paper, we observe that the features of DINOv2 can effectively model semantic information and are independent of degradation factors. Motivated by this observation, we propose \mbox{\textbf{DINO-IR}}, a multi-task image restoration approach leveraging robust features extracted from DINOv2 to solve multi-task image restoration simultaneously. We first propose a pixel-semantic fusion (PSF) module to dynamically fuse DINOV2's shallow features containing pixel-level information and deep features containing degradation-independent semantic information. To guide the restoration model with the features of DINOv2, we develop a DINO-Restore adaption and fusion module to adjust the channel of fused features from PSF and then integrate them with the features from the restoration model. By formulating these modules into a unified deep model, we propose a DINO perception contrastive loss to constrain the model training. Extensive experimental results demonstrate that our DINO-IR performs favorably against existing multi-task image restoration approaches in various tasks by a large margin. The source codes and trained models will be made available.

Multi-task Image Restoration Guided By Robust DINO Features

TL;DR

This work tackles the challenge of multi-task image restoration by leveraging degradation-agnostic semantic representations from the pre-trained DINOv2 model. It introduces DINO-IR, a framework that fuses DINOv2 features through Pixel-Semantic Fusion, adapts and merges them with a restoration backbone via DINO-R adaption and self-attention, and enforces robust guidance with a DINO perception contrastive loss. Across four degradations, DINO-IR outperforms state-of-the-art multi-task methods, especially on unseen data, while single-task performance remains competitive. The results suggest that robust features from large pre-trained vision models can significantly improve efficiency and performance in multi-task restoration, providing a practical route toward degradation-agnostic restoration systems.

Abstract

Multi-task image restoration has gained significant interest due to its inherent versatility and efficiency compared to its single-task counterpart. However, performance decline is observed with an increase in the number of tasks, primarily attributed to the restoration model's challenge in handling different tasks with distinct natures at the same time. Thus, a perspective emerged aiming to explore the degradation-insensitive semantic commonalities among different degradation tasks. In this paper, we observe that the features of DINOv2 can effectively model semantic information and are independent of degradation factors. Motivated by this observation, we propose \mbox{\textbf{DINO-IR}}, a multi-task image restoration approach leveraging robust features extracted from DINOv2 to solve multi-task image restoration simultaneously. We first propose a pixel-semantic fusion (PSF) module to dynamically fuse DINOV2's shallow features containing pixel-level information and deep features containing degradation-independent semantic information. To guide the restoration model with the features of DINOv2, we develop a DINO-Restore adaption and fusion module to adjust the channel of fused features from PSF and then integrate them with the features from the restoration model. By formulating these modules into a unified deep model, we propose a DINO perception contrastive loss to constrain the model training. Extensive experimental results demonstrate that our DINO-IR performs favorably against existing multi-task image restoration approaches in various tasks by a large margin. The source codes and trained models will be made available.
Paper Structure (13 sections, 4 equations, 7 figures, 6 tables)

This paper contains 13 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: (a) Results from both single-task and multi-task image restoration for Restormer restormer and our DINO-IR, with ours demonstrating superior performance in these tasks. (b) Comparison of the performance of our method and Restormer in the deblurring task as the number of restoration tasks increased. Both DINO-IR and Restormer experience a decline in performance. However, due to the stable features provided by DINOv2, our method exhibits enhanced stability, mitigating the extent of the performance decline and ultimately surpassing the baseline.
  • Figure 2: DINO-IR framework comprising the following components: 1. Pixel-semantic fusion (PSF) module and DINO-Restore (D-R) adaption and fusion module: They facilitate better fusion of shallow, medium, and deep DINOv2 features to guide restoration. 2. Restoration: This component takes low-quality images as input and utilizes a restoration network to return them to a clean version. 3. DINO perception contrastive (DPC Loss): This loss function enhances performance through DINOv2 feature contrastive learning.
  • Figure 3: (a) We obtain rainy, noisy, and blurry images by degrading a clean version in three ways. Following this, we extract features from these images by the deep layer of DINOv2 and visualize them by feature projection as used in dino1. (b) We compare the variations in the deviation of PSNR for Image, $f_{\mathrm{IMAGE}}$, $f_{\mathrm{DINO}}$ as the noise level increases.
  • Figure 4: (a) The architecture of the Pixel-semantic fusion (PSF) module. PSF is employed to fuse the shallow pixel-level and deep semantic features of DINOv2. (b) The architecture of DINO-Restore (D-R) adaption and fusion module. Its task is to adjust the size of DINOv2 features to fit the restoration network and then fuse with the feature from the restoration network by a self-attention-based method.
  • Figure 5: Visualization of the features from the shallow layers of DINOv2 for high rain, light rain, and no rain images.
  • ...and 2 more figures