Table of Contents
Fetching ...

CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain

Jingchao Peng, Thomas Bashford-Rogers, Zhuang Shao, Haitao Zhao, Aru Ranjan Singh, Abhishek Goswami, Kurt Debattista

TL;DR

CapHDR2IR, a novel framework incorporating vision-language models using high dynamic range (HDR) images as inputs to generate IR images, achieves state-of-the-art performance compared with existing general domain transfer methods and those tailored for visible-to-infrared image translation.

Abstract

Infrared (IR) imaging offers advantages in several fields due to its unique ability of capturing content in extreme light conditions. However, the demanding hardware requirements of high-resolution IR sensors limit its widespread application. As an alternative, visible light can be used to synthesize IR images but this causes a loss of fidelity in image details and introduces inconsistencies due to lack of contextual awareness of the scene. This stems from a combination of using visible light with a standard dynamic range, especially under extreme lighting, and a lack of contextual awareness can result in pseudo-thermal-crossover artifacts. This occurs when multiple objects with similar temperatures appear indistinguishable in the training data, further exacerbating the loss of fidelity. To solve this challenge, this paper proposes CapHDR2IR, a novel framework incorporating vision-language models using high dynamic range (HDR) images as inputs to generate IR images. HDR images capture a wider range of luminance variations, ensuring reliable IR image generation in different light conditions. Additionally, a dense caption branch integrates semantic understanding, resulting in more meaningful and discernible IR outputs. Extensive experiments on the HDRT dataset show that the proposed CapHDR2IR achieves state-of-the-art performance compared with existing general domain transfer methods and those tailored for visible-to-infrared image translation.

CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain

TL;DR

CapHDR2IR, a novel framework incorporating vision-language models using high dynamic range (HDR) images as inputs to generate IR images, achieves state-of-the-art performance compared with existing general domain transfer methods and those tailored for visible-to-infrared image translation.

Abstract

Infrared (IR) imaging offers advantages in several fields due to its unique ability of capturing content in extreme light conditions. However, the demanding hardware requirements of high-resolution IR sensors limit its widespread application. As an alternative, visible light can be used to synthesize IR images but this causes a loss of fidelity in image details and introduces inconsistencies due to lack of contextual awareness of the scene. This stems from a combination of using visible light with a standard dynamic range, especially under extreme lighting, and a lack of contextual awareness can result in pseudo-thermal-crossover artifacts. This occurs when multiple objects with similar temperatures appear indistinguishable in the training data, further exacerbating the loss of fidelity. To solve this challenge, this paper proposes CapHDR2IR, a novel framework incorporating vision-language models using high dynamic range (HDR) images as inputs to generate IR images. HDR images capture a wider range of luminance variations, ensuring reliable IR image generation in different light conditions. Additionally, a dense caption branch integrates semantic understanding, resulting in more meaningful and discernible IR outputs. Extensive experiments on the HDRT dataset show that the proposed CapHDR2IR achieves state-of-the-art performance compared with existing general domain transfer methods and those tailored for visible-to-infrared image translation.

Paper Structure

This paper contains 15 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Motivation of the proposed CapHDR2IR model.
  • Figure 2: The overall structure of the proposed CapHDR2IR. The process comprises two branches: a caption branch (upper part) and an image generation branch (lower part). The caption branch adopts a ResNet-101 backbone pre-trained on dense captioning to comprehend the image context. In the IR image generation branch, the encoder stage integrates features with contextual information through caption fusion modules (CF), while the decoder stage refines and upscales these features by up-sampling fusion modules (USF) to generate infrared images that retain both high dynamic range content and scene context.
  • Figure 3: Visualization results on the HDRT dataset.
  • Figure 4: Comparison of generation performance using SDR vs. HDR inputs.
  • Figure 5: Comparison of generation performance with or without the caption branch.