Table of Contents
Fetching ...

Inference-Time Scaling of Diffusion Models for Infrared Data Generation

Kai A. Horstmann, Maxim Clouser, Kia Khezeli

TL;DR

This work tackles the scarcity of annotated infrared data for diffusion-based image generation by shifting the burden from training to inference. It trains a domain-adapted CLIP verifier on infrared data and uses inference-time sampling guided by this verifier to produce higher-quality infrared images from a pretrained diffusion model Finetuning FLUX.1-dev with LoRA on 1,000 infrared samples, the approach achieves meaningful improvements in FID on the KAIST dataset (reducing from 74.58 to 66.74, about a 10% relative gain) and demonstrates that random-search-based sampling can effectively navigate the noise latent space under limited compute. The study highlights the practicality of inference-time guidance for bridging the domain gap in low-data infrared settings and points to future directions including physics-informed verifiers and broader modality evaluation.

Abstract

Infrared imagery enables temperature-based scene understanding using passive sensors, particularly under conditions of low visibility where traditional RGB imaging fails. Yet, developing downstream vision models for infrared applications is hindered by the scarcity of high-quality annotated data, due to the specialized expertise required for infrared annotation. While synthetic infrared image generation has the potential to accelerate model development by providing large-scale, diverse training data, training foundation-level generative diffusion models in the infrared domain has remained elusive due to limited datasets. In light of such data constraints, we explore an inference-time scaling approach using a domain-adapted CLIP-based verifier for enhanced infrared image generation quality. We adapt FLUX.1-dev, a state-of-the-art text-to-image diffusion model, to the infrared domain by finetuning it on a small sample of infrared images using parameter-efficient techniques. The trained verifier is then employed during inference to guide the diffusion sampling process toward higher quality infrared generations that better align with input text prompts. Empirically, we find that our approach leads to consistent improvements in generation quality, reducing FID scores on the KAIST Multispectral Pedestrian Detection Benchmark dataset by 10% compared to unguided baseline samples. Our results suggest that inference-time guidance offers a promising direction for bridging the domain gap in low-data infrared settings.

Inference-Time Scaling of Diffusion Models for Infrared Data Generation

TL;DR

This work tackles the scarcity of annotated infrared data for diffusion-based image generation by shifting the burden from training to inference. It trains a domain-adapted CLIP verifier on infrared data and uses inference-time sampling guided by this verifier to produce higher-quality infrared images from a pretrained diffusion model Finetuning FLUX.1-dev with LoRA on 1,000 infrared samples, the approach achieves meaningful improvements in FID on the KAIST dataset (reducing from 74.58 to 66.74, about a 10% relative gain) and demonstrates that random-search-based sampling can effectively navigate the noise latent space under limited compute. The study highlights the practicality of inference-time guidance for bridging the domain gap in low-data infrared settings and points to future directions including physics-informed verifiers and broader modality evaluation.

Abstract

Infrared imagery enables temperature-based scene understanding using passive sensors, particularly under conditions of low visibility where traditional RGB imaging fails. Yet, developing downstream vision models for infrared applications is hindered by the scarcity of high-quality annotated data, due to the specialized expertise required for infrared annotation. While synthetic infrared image generation has the potential to accelerate model development by providing large-scale, diverse training data, training foundation-level generative diffusion models in the infrared domain has remained elusive due to limited datasets. In light of such data constraints, we explore an inference-time scaling approach using a domain-adapted CLIP-based verifier for enhanced infrared image generation quality. We adapt FLUX.1-dev, a state-of-the-art text-to-image diffusion model, to the infrared domain by finetuning it on a small sample of infrared images using parameter-efficient techniques. The trained verifier is then employed during inference to guide the diffusion sampling process toward higher quality infrared generations that better align with input text prompts. Empirically, we find that our approach leads to consistent improvements in generation quality, reducing FID scores on the KAIST Multispectral Pedestrian Detection Benchmark dataset by 10% compared to unguided baseline samples. Our results suggest that inference-time guidance offers a promising direction for bridging the domain gap in low-data infrared settings.

Paper Structure

This paper contains 14 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: Example verifier scores for a ground truth infrared image and two synthetic images (generated from different random seeds) corresponding to the caption, "A city street at dusk features tall buildings with illuminated signs, a marked road with directional arrows, and vehicles including a white SUV driving away from the camera." IRScore (ours), IR Similarity, and Grayscale Similarity are computed using our finetuned CLIP model, while IRScore (pretrained) relies on a pretrained CLIP model. IRScore, IR Similarity, and Grayscale Similarity correspond to Equation \ref{['eq:ir-score']}, and its unscaled first and second terms, respectively.