Table of Contents
Fetching ...

Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks

Jiawei Wu, Zhi Jin

TL;DR

The paper tackles bridging image restoration with high-level vision tasks in settings where paired data and retraining are impractical. It reframes the problem as learning a joint distribution between restoration outputs and high-level vision inputs via a variational objective that combines content-preserving reconstruction with maximizing high-level task likelihood, while using self-training to avoid labels. The proposed Unsupervised Variational Translator (VaT) introduces a lightweight translator built from a gated fusion module and a U-shaped transformation module, guided by cycle-consistency and uncertainty-aware self-training to connect pre-trained restoration and vision models without retraining. Across dehazing and low-light scenarios, VaT significantly improves high-level vision performance compared to unsupervised baselines and even surpasses some supervised methods, demonstrating practical potential for real-world degraded environments. The work establishes a principled, unsupervised pathway to align restoration with machine perception and hints at future extensions to multimodal large-language models for broader usefulness.

Abstract

Recent research tries to extend image restoration capabilities from human perception to machine perception, thereby enhancing the performance of high-level vision tasks in degraded environments. These methods, primarily based on supervised learning, typically involve the retraining of restoration networks or high-level vision networks. However, collecting paired data in real-world scenarios and retraining large-scale models are challenge. To this end, we propose an unsupervised learning method called \textbf{Va}riational \textbf{T}ranslator (VaT), which does not require retraining existing restoration and high-level vision networks. Instead, it establishes a lightweight network that serves as an intermediate bridge between them. By variational inference, VaT approximates the joint distribution of restoration output and high-level vision input, dividing the optimization objective into preserving content and maximizing marginal likelihood associated with high-level vision tasks. By cleverly leveraging self-training paradigms, VaT achieves the above optimization objective without requiring labels. As a result, the translated images maintain a close resemblance to their original content while also demonstrating exceptional performance on high-level vision tasks. Extensive experiments in dehazing and low-light enhancement for detection and classification show the superiority of our method over other state-of-the-art unsupervised counterparts, even significantly surpassing supervised methods in some complex real-world scenarios.Code is available at https://github.com/Fire-friend/VaT.

Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks

TL;DR

The paper tackles bridging image restoration with high-level vision tasks in settings where paired data and retraining are impractical. It reframes the problem as learning a joint distribution between restoration outputs and high-level vision inputs via a variational objective that combines content-preserving reconstruction with maximizing high-level task likelihood, while using self-training to avoid labels. The proposed Unsupervised Variational Translator (VaT) introduces a lightweight translator built from a gated fusion module and a U-shaped transformation module, guided by cycle-consistency and uncertainty-aware self-training to connect pre-trained restoration and vision models without retraining. Across dehazing and low-light scenarios, VaT significantly improves high-level vision performance compared to unsupervised baselines and even surpasses some supervised methods, demonstrating practical potential for real-world degraded environments. The work establishes a principled, unsupervised pathway to align restoration with machine perception and hints at future extensions to multimodal large-language models for broader usefulness.

Abstract

Recent research tries to extend image restoration capabilities from human perception to machine perception, thereby enhancing the performance of high-level vision tasks in degraded environments. These methods, primarily based on supervised learning, typically involve the retraining of restoration networks or high-level vision networks. However, collecting paired data in real-world scenarios and retraining large-scale models are challenge. To this end, we propose an unsupervised learning method called \textbf{Va}riational \textbf{T}ranslator (VaT), which does not require retraining existing restoration and high-level vision networks. Instead, it establishes a lightweight network that serves as an intermediate bridge between them. By variational inference, VaT approximates the joint distribution of restoration output and high-level vision input, dividing the optimization objective into preserving content and maximizing marginal likelihood associated with high-level vision tasks. By cleverly leveraging self-training paradigms, VaT achieves the above optimization objective without requiring labels. As a result, the translated images maintain a close resemblance to their original content while also demonstrating exceptional performance on high-level vision tasks. Extensive experiments in dehazing and low-light enhancement for detection and classification show the superiority of our method over other state-of-the-art unsupervised counterparts, even significantly surpassing supervised methods in some complex real-world scenarios.Code is available at https://github.com/Fire-friend/VaT.
Paper Structure (16 sections, 8 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 8 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: The fundamental insight that restoration cannot directly enhance high-level vision tasks. While restoration effectively enhances human perception by pixel-to-pixel, it may not improve machine perception. This discrepancy arises due to the distinct distribution gap between the restoration output and high-level vision input.
  • Figure 2: Overview of our variational translator (VaT). Uncertainty-guided pseudo-label generation is used to obtain pseudo-labels for degraded images. The Gated Fusion Module (GFM) combines degraded and restored images, which are then fed into the Transformation Module (TM). The acquired pseudo-labels are then employed to supervise the detection prediction of translated images augmented by mixup.
  • Figure 3: Our model tends to contain the cycle-consistency between fusion output $I_F$ and VaT output $I_{HQ}$, including three mapping functions $\mathcal{G}_A: I_R, I_{LQ} \rightarrow I_F$, $\mathcal{T}_A:I_F \rightarrow I_{HQ}$, and $\mathcal{T}_B:I_{HQ} \rightarrow I_F$. (a) Forward cycle-consistency loss: $\mathcal{R}_A(I_R, I_{LQ}) \rightarrow \mathcal{T}_A(\mathcal{G}_A(I_R, I_{LQ})) \rightarrow \mathcal{T}_B(\mathcal{T}_A(\mathcal{G}_A(I_R, I_{LQ}))) \approx \mathcal{G}_A(I_R, I_{LQ})$. (b) Backward cycle-consistency loss: $I_{HQ} \rightarrow \mathcal{T}_B(I_{HQ}) \rightarrow \mathcal{T}_A(\mathcal{T}_B(I_{HQ})) \approx I_{HQ}$.
  • Figure 4: Qualitative comparison of detection results on RTTS li2018benchmarking. From top to down are the detection results of the degraded images, the restored images by AONet li2017aod, and the translated images by VaT.
  • Figure 5: Visual comparison of object detection results on ExDark dataset.
  • ...and 3 more figures

Theorems & Definitions (2)

  • remark thmcounterremark
  • remark thmcounterremark