Table of Contents
Fetching ...

TC-PDM: Temporally Consistent Patch Diffusion Models for Infrared-to-Visible Video Translation

Anh-Dzung Doan, Vu Minh Hieu Phan, Surabhi Gupta, Markus Wagner, Tat-Jun Chin, Ian Reid

TL;DR

TC-PDM tackles infrared-to-visible video translation by enforcing semantic structure and temporal coherence through semantic conditioning and a flow-guided temporal blending module. It introduces semantic conditioning via segmentation logits from a foundational model and uses dense optical-flow correspondences to steer denoising trajectories within a patch-based diffusion framework. Empirical results on KAIST and M3FD show substantial improvements in $FVD$ for I2V translation and in $AP_{50}$ for day-to-night detection, outperforming strong baselines including PDIR and T2V-DDPM. The approach yields more realistic, temporally consistent videos and improved downstream detection performance, with a public code release enabling reproducibility.

Abstract

Infrared imaging offers resilience against changing lighting conditions by capturing object temperatures. Yet, in few scenarios, its lack of visual details compared to daytime visible images, poses a significant challenge for human and machine interpretation. This paper proposes a novel diffusion method, dubbed Temporally Consistent Patch Diffusion Models (TC-DPM), for infrared-to-visible video translation. Our method, extending the Patch Diffusion Model, consists of two key components. Firstly, we propose a semantic-guided denoising, leveraging the strong representations of foundational models. As such, our method faithfully preserves the semantic structure of generated visible images. Secondly, we propose a novel temporal blending module to guide the denoising trajectory, ensuring the temporal consistency between consecutive frames. Experiment shows that TC-PDM outperforms state-of-the-art methods by 35.3% in FVD for infrared-to-visible video translation and by 6.1% in AP50 for day-to-night object detection. Our code is publicly available at https://github.com/dzungdoan6/tc-pdm

TC-PDM: Temporally Consistent Patch Diffusion Models for Infrared-to-Visible Video Translation

TL;DR

TC-PDM tackles infrared-to-visible video translation by enforcing semantic structure and temporal coherence through semantic conditioning and a flow-guided temporal blending module. It introduces semantic conditioning via segmentation logits from a foundational model and uses dense optical-flow correspondences to steer denoising trajectories within a patch-based diffusion framework. Empirical results on KAIST and M3FD show substantial improvements in for I2V translation and in for day-to-night detection, outperforming strong baselines including PDIR and T2V-DDPM. The approach yields more realistic, temporally consistent videos and improved downstream detection performance, with a public code release enabling reproducibility.

Abstract

Infrared imaging offers resilience against changing lighting conditions by capturing object temperatures. Yet, in few scenarios, its lack of visual details compared to daytime visible images, poses a significant challenge for human and machine interpretation. This paper proposes a novel diffusion method, dubbed Temporally Consistent Patch Diffusion Models (TC-DPM), for infrared-to-visible video translation. Our method, extending the Patch Diffusion Model, consists of two key components. Firstly, we propose a semantic-guided denoising, leveraging the strong representations of foundational models. As such, our method faithfully preserves the semantic structure of generated visible images. Secondly, we propose a novel temporal blending module to guide the denoising trajectory, ensuring the temporal consistency between consecutive frames. Experiment shows that TC-PDM outperforms state-of-the-art methods by 35.3% in FVD for infrared-to-visible video translation and by 6.1% in AP50 for day-to-night object detection. Our code is publicly available at https://github.com/dzungdoan6/tc-pdm
Paper Structure (33 sections, 7 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 33 sections, 7 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Day $\rightarrow$ night object detection: We address the challenge of adapting object detectors, pretrained on daytime visible images, to nighttime scenarios. Unlike the state-of-the-art patch diffusion method PDIR pdir, our method effectively preserves object structure, yielding a significant performance improvement in nighttime object detection.
  • Figure 2: Benefits of semantic conditioning. (a) Actual nighttime visible image. (b) Existing patch diffusion methods introduce semantic deformation in generated visible images. (c) Our method effectively addresses this semantic deformation through a novel semantic conditioning strategy, enabling the generation of a visible image that faithfully preserve the scene's structure.
  • Figure 3: Overview of TC-PDM at timestep $t$: Our contribution consist of two components. (a) Semantic-guided denoising: By leveraging the foundational model, we estimate the segmentation logits $S_i$ for the infrared image $Y_i$, which injects additional semantic knowledge into the denoising process. This additional semantic condition ensures the generated visible image faithfully reproduces the scene's structural information. (b) Temporal blending: We leverage a pretrained optical flow network to estimate the flow $F_{i-1 \, \rightarrow \, i}$ from consecutive infrared images, which serves as a guidance for the denoising trajectory's direction. This ensures that the generated frame remains temporally consistent with the preceding frame.
  • Figure 4: The neglect of temporal consistency in patch diffusion method PDIR leads to structurally-inconsistent objects across frames (highlighted). By contrast, our method uses a novel temporal blending module to maintain consistent object structures across frames, achieving temporal continuity.
  • Figure 5: Qualitative comparison of different translation methods on the KAIST (rows 1-2) and M3FD (rows 3-4) datasets. In KAIST, our TC-PDM generates visible images that are perceptually and structurally more similar to the actual visible images. Similarly, on M3FD, our generated "fake" visible images exhibit more details and preserve the structure of dynamic objects like cars. More qualitative results can be found in Sec. \ref{['supp:sec:i2v_more_results']} in supplementary material.
  • ...and 4 more figures