Table of Contents
Fetching ...

Conditional Consistency Guided Image Translation and Enhancement

Amil Bhagat, Milind Jain, A. V. Subramanyam

TL;DR

This paper tackles multi-domain image translation and low-light image enhancement by extending consistency models with conditional inputs. It introduces Conditional Consistency Models (CCMs) that incorporate a conditional image to guide the denoising process, enabling conditional translation and enhancement in a single-step inference without adversarial training. The authors formulate a conditional consistency function, propose Conditional Consistency Training (CCT), and validate across ten datasets, showing competitive or superior performance on several benchmarks while demonstrating strong generalization. The work offers a fast, robust alternative to diffusion and GAN-based approaches, with potential practical impact in cross-modal vision and medical imaging workflows.

Abstract

Consistency models have emerged as a promising alternative to diffusion models, offering high-quality generative capabilities through single-step sample generation. However, their application to multi-domain image translation tasks, such as cross-modal translation and low-light image enhancement remains largely unexplored. In this paper, we introduce Conditional Consistency Models (CCMs) for multi-domain image translation by incorporating additional conditional inputs. We implement these modifications by introducing task-specific conditional inputs that guide the denoising process, ensuring that the generated outputs retain structural and contextual information from the corresponding input domain. We evaluate CCMs on 10 different datasets demonstrating their effectiveness in producing high-quality translated images across multiple domains. Code is available at https://github.com/amilbhagat/Conditional-Consistency-Models.

Conditional Consistency Guided Image Translation and Enhancement

TL;DR

This paper tackles multi-domain image translation and low-light image enhancement by extending consistency models with conditional inputs. It introduces Conditional Consistency Models (CCMs) that incorporate a conditional image to guide the denoising process, enabling conditional translation and enhancement in a single-step inference without adversarial training. The authors formulate a conditional consistency function, propose Conditional Consistency Training (CCT), and validate across ten datasets, showing competitive or superior performance on several benchmarks while demonstrating strong generalization. The work offers a fast, robust alternative to diffusion and GAN-based approaches, with potential practical impact in cross-modal vision and medical imaging workflows.

Abstract

Consistency models have emerged as a promising alternative to diffusion models, offering high-quality generative capabilities through single-step sample generation. However, their application to multi-domain image translation tasks, such as cross-modal translation and low-light image enhancement remains largely unexplored. In this paper, we introduce Conditional Consistency Models (CCMs) for multi-domain image translation by incorporating additional conditional inputs. We implement these modifications by introducing task-specific conditional inputs that guide the denoising process, ensuring that the generated outputs retain structural and contextual information from the corresponding input domain. We evaluate CCMs on 10 different datasets demonstrating their effectiveness in producing high-quality translated images across multiple domains. Code is available at https://github.com/amilbhagat/Conditional-Consistency-Models.
Paper Structure (17 sections, 11 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 11 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Model Architecture. Our model can take a pair of visible-infrared, HE-IHC, or, low light and well-exposed images. Visible, HE or low light image acts as a conditional input. The noise as per time step $t$ is added to the input infrared, IHC or well exposed image. This noisy image is then concatenated with the condition input and fed to the U-Net. The model can be sampled to obtain the infrared, IHC or enhanced image.
  • Figure 2: Comparison of (a) Visible, (b) Ground Truth Infrared, and (c) Generated Infrared images.
  • Figure 3: Comparison of (a) HE, (b) Ground Truth IHC, and (c) Generated IHC images
  • Figure 4: Comparison of results on different datasets: Firs row: LoL-v1, Second Row: LoLv2-real, Last row: LoLv2-synthetic. Columns represent: (a) Low-Light Input, (b) RetinexFormer, (c) Ours, and (d) Ground Truth.
  • Figure 5: First row: LIME, Second row: NPE, Third row: MEF, Fourth row: DICM, Last row: VV. Columns: (a) Input Image, (b) RetinexFormer, (c) Ours.