Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks
Jiawei Wu, Zhi Jin
TL;DR
The paper tackles bridging image restoration with high-level vision tasks in settings where paired data and retraining are impractical. It reframes the problem as learning a joint distribution between restoration outputs and high-level vision inputs via a variational objective that combines content-preserving reconstruction with maximizing high-level task likelihood, while using self-training to avoid labels. The proposed Unsupervised Variational Translator (VaT) introduces a lightweight translator built from a gated fusion module and a U-shaped transformation module, guided by cycle-consistency and uncertainty-aware self-training to connect pre-trained restoration and vision models without retraining. Across dehazing and low-light scenarios, VaT significantly improves high-level vision performance compared to unsupervised baselines and even surpasses some supervised methods, demonstrating practical potential for real-world degraded environments. The work establishes a principled, unsupervised pathway to align restoration with machine perception and hints at future extensions to multimodal large-language models for broader usefulness.
Abstract
Recent research tries to extend image restoration capabilities from human perception to machine perception, thereby enhancing the performance of high-level vision tasks in degraded environments. These methods, primarily based on supervised learning, typically involve the retraining of restoration networks or high-level vision networks. However, collecting paired data in real-world scenarios and retraining large-scale models are challenge. To this end, we propose an unsupervised learning method called \textbf{Va}riational \textbf{T}ranslator (VaT), which does not require retraining existing restoration and high-level vision networks. Instead, it establishes a lightweight network that serves as an intermediate bridge between them. By variational inference, VaT approximates the joint distribution of restoration output and high-level vision input, dividing the optimization objective into preserving content and maximizing marginal likelihood associated with high-level vision tasks. By cleverly leveraging self-training paradigms, VaT achieves the above optimization objective without requiring labels. As a result, the translated images maintain a close resemblance to their original content while also demonstrating exceptional performance on high-level vision tasks. Extensive experiments in dehazing and low-light enhancement for detection and classification show the superiority of our method over other state-of-the-art unsupervised counterparts, even significantly surpassing supervised methods in some complex real-world scenarios.Code is available at https://github.com/Fire-friend/VaT.
