Table of Contents
Fetching ...

Tell Me What You See: Text-Guided Real-World Image Denoising

Erez Yosef, Raja Giryes

TL;DR

This work tackles denoising in real-world, low-light imaging by using a text caption as an extra prior. It introduces a text-guided diffusion model operating in the raw sensor domain, conditioned on CLIP embeddings of scene captions and fused with the noisy input. A realistic sensor-noise model and a LoRA-based fine-tuning pipeline bridge simulated and real-world statistics, with evaluations on synthetic and real captures from two cameras. The results show enhanced perceptual quality and texture fidelity when captions guide reconstruction, suggesting a practical pathway for caption-assisted photography.

Abstract

Image reconstruction from noisy sensor measurements is challenging and many methods have been proposed for it. Yet, most approaches focus on learning robust natural image priors while modeling the scene's noise statistics. In extremely low-light conditions, these methods often remain insufficient. Additional information is needed, such as multiple captures or, as suggested here, scene description. As an alternative, we propose using a text-based description of the scene as an additional prior, something the photographer can easily provide. Inspired by the remarkable success of text-guided diffusion models in image generation, we show that adding image caption information significantly improves image denoising and reconstruction for both synthetic and real-world images.

Tell Me What You See: Text-Guided Real-World Image Denoising

TL;DR

This work tackles denoising in real-world, low-light imaging by using a text caption as an extra prior. It introduces a text-guided diffusion model operating in the raw sensor domain, conditioned on CLIP embeddings of scene captions and fused with the noisy input. A realistic sensor-noise model and a LoRA-based fine-tuning pipeline bridge simulated and real-world statistics, with evaluations on synthetic and real captures from two cameras. The results show enhanced perceptual quality and texture fidelity when captions guide reconstruction, suggesting a practical pathway for caption-assisted photography.

Abstract

Image reconstruction from noisy sensor measurements is challenging and many methods have been proposed for it. Yet, most approaches focus on learning robust natural image priors while modeling the scene's noise statistics. In extremely low-light conditions, these methods often remain insufficient. Additional information is needed, such as multiple captures or, as suggested here, scene description. As an alternative, we propose using a text-based description of the scene as an additional prior, something the photographer can easily provide. Inspired by the remarkable success of text-guided diffusion models in image generation, we show that adding image caption information significantly improves image denoising and reconstruction for both synthetic and real-world images.
Paper Structure (10 sections, 7 equations, 15 figures, 5 tables)

This paper contains 10 sections, 7 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Raw noisy images captured with a smartphone camera (left) were reconstructed using diffusion models, both without (center) and with (right) a text caption. The contribution of the text description to the reconstruction and perceptual quality is significant.
  • Figure 2: Proposed training framework: Initially (left), we train a diffusion model on the COCO-captions dataset chen2015microsoft with simulated noise. Then (right), we fine-tune the model for real-world noise on screen-captured images. Each sample is captured twice with different camera settings, resulting in noisy and clean image pairs for training. In this scheme, foocb!150 blue blocks weights are pre-trained and fixed, while foocy!120 yellow blocks weights are trained in each stage.
  • Figure 3: Our dataset consists of raw noisy images containing objects and their corresponding captions. Existing raw noisy image datasets (e.g., SIDD abdelhamed2018high and DND plotz2017benchmarking) include patches from large images that lack captionable content. In contrast, caption datasets (e.g., COCO-captions chen2015microsoft) contain RGB images with captions but do not include raw noisy sensor images. Therefore, our proposed dataset is novel and contributes to text-guided real-world image denoising tasks.
  • Figure 4: Low simulated noise results. Comparison of various methods for raw image denoising at a noise level of 0.1 ($\log\lambda_{shot}=0.1$ and $\log\lambda_{read}=0.2$). Our models achieve superior performance, with text guidance significantly enhancing perceptual quality, details, and textures.
  • Figure 5: Real-world denoising comparison of various methods applied to Samsung S21 camera captures. Samples from the COCO dataset were displayed on a screen and captured twice under different settings to generate noisy and GT pairs. Our models were fine-tuned on real data using the proposed approach. Our text-guided model achieves superior results compared to competing methods, including a non-text-guided diffusion model. The tag raw/RGB indicates whether denoising was applied to raw or RGB images.
  • ...and 10 more figures