Table of Contents
Fetching ...

TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation

Dong-Guw Lee, Tai Hyoung Rhee, Hyunsoo Jang, Young-Sik Shin, Ukcheol Shin, Ayoung Kim

TL;DR

TherA is introduced, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level and achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.

Abstract

Despite the inherent advantages of thermal infrared(TIR) imaging, large-scale data collection and annotation remain a major bottleneck for TIR-based perception. A practical alternative is to synthesize pseudo TIR data via image translation; however, most RGB-to-TIR approaches heavily rely on RGB-centric priors that overlook thermal physics, yielding implausible heat distributions. In this paper, we introduce TherA, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level. TherA couples TherA-VLM with a latent-diffusion-based translator. Given a single RGB image and a user-prompted condition pair, TherA-VLM yields a thermal-aware embedding that encodes scene, object, material, and heat-emission context reflecting the input scene-condition pair. Conditioning the diffusion model on this embedding enables realistic TIR synthesis and fine-grained control across time of day, weather, and object state. Compared to other baselines, TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.

TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation

TL;DR

TherA is introduced, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level and achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.

Abstract

Despite the inherent advantages of thermal infrared(TIR) imaging, large-scale data collection and annotation remain a major bottleneck for TIR-based perception. A practical alternative is to synthesize pseudo TIR data via image translation; however, most RGB-to-TIR approaches heavily rely on RGB-centric priors that overlook thermal physics, yielding implausible heat distributions. In this paper, we introduce TherA, a controllable RGB-to-TIR translation framework that produces diverse and thermally plausible images at both scene and object level. TherA couples TherA-VLM with a latent-diffusion-based translator. Given a single RGB image and a user-prompted condition pair, TherA-VLM yields a thermal-aware embedding that encodes scene, object, material, and heat-emission context reflecting the input scene-condition pair. Conditioning the diffusion model on this embedding enables realistic TIR synthesis and fine-grained control across time of day, weather, and object state. Compared to other baselines, TherA achieves state-of-the-art translation performance, demonstrating improved zero-shot translation performance up to 33% increase averaged across all metrics.
Paper Structure (59 sections, 8 equations, 16 figures, 7 tables)

This paper contains 59 sections, 8 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Overview of TherA. (a) Compared to InstructPix2Pix brooks2023instructpix2pix, our thermal-aware VLM distinguishes between active (heat-emitting car) and passive (parked car) objects as shown in the translated image from WayMo sun2020waymo open dataset. (b) TherA is the first RGB-to-TIR translation model offering both text-guided and reference image-guided translations. RGB images from M3FD m3fd are used for reference
  • Figure 2: Overview of the TherA framework. Our model consists of two stages. (a) TherA-VLM analyzes the input/reference RGB and the input prompt to produce a physically grounded thermal embedding ($\textbf{h}_N$). (b) This thermal embedding is injected via TE Adapter ($\phi$) into the cross-attention layers of a UNet. The UNet, also conditioned on the RGB latent ($z_{\mathrm{rgb}}$), guides the denoising of the noisy TIR latent ($z_t$) to generate a physically plausible TIR image. The input prompt and reference RGB supports controllability.
  • Figure 3: Comparison of textual conditioning. Standard LLaVA produces appearance-based, RGB-centric descriptions (c). Our TherA-VLM outputs concise, structured schemas encoding scene type, materials, and heat emissionstates (d).
  • Figure 4: Qualitative results on zero-shot image translation. No TIR ground truth exists for RGB-only datasets.
  • Figure 5: Qualitative results for scene-level and object-level controllability on M3FD and FLIR. Yellow boxes show zoomed views.
  • ...and 11 more figures