Tell Me What You See: Text-Guided Real-World Image Denoising
Erez Yosef, Raja Giryes
TL;DR
This work tackles denoising in real-world, low-light imaging by using a text caption as an extra prior. It introduces a text-guided diffusion model operating in the raw sensor domain, conditioned on CLIP embeddings of scene captions and fused with the noisy input. A realistic sensor-noise model and a LoRA-based fine-tuning pipeline bridge simulated and real-world statistics, with evaluations on synthetic and real captures from two cameras. The results show enhanced perceptual quality and texture fidelity when captions guide reconstruction, suggesting a practical pathway for caption-assisted photography.
Abstract
Image reconstruction from noisy sensor measurements is challenging and many methods have been proposed for it. Yet, most approaches focus on learning robust natural image priors while modeling the scene's noise statistics. In extremely low-light conditions, these methods often remain insufficient. Additional information is needed, such as multiple captures or, as suggested here, scene description. As an alternative, we propose using a text-based description of the scene as an additional prior, something the photographer can easily provide. Inspired by the remarkable success of text-guided diffusion models in image generation, we show that adding image caption information significantly improves image denoising and reconstruction for both synthetic and real-world images.
