Inference-Time Scaling of Diffusion Models for Infrared Data Generation
Kai A. Horstmann, Maxim Clouser, Kia Khezeli
TL;DR
This work tackles the scarcity of annotated infrared data for diffusion-based image generation by shifting the burden from training to inference. It trains a domain-adapted CLIP verifier on infrared data and uses inference-time sampling guided by this verifier to produce higher-quality infrared images from a pretrained diffusion model Finetuning FLUX.1-dev with LoRA on 1,000 infrared samples, the approach achieves meaningful improvements in FID on the KAIST dataset (reducing from 74.58 to 66.74, about a 10% relative gain) and demonstrates that random-search-based sampling can effectively navigate the noise latent space under limited compute. The study highlights the practicality of inference-time guidance for bridging the domain gap in low-data infrared settings and points to future directions including physics-informed verifiers and broader modality evaluation.
Abstract
Infrared imagery enables temperature-based scene understanding using passive sensors, particularly under conditions of low visibility where traditional RGB imaging fails. Yet, developing downstream vision models for infrared applications is hindered by the scarcity of high-quality annotated data, due to the specialized expertise required for infrared annotation. While synthetic infrared image generation has the potential to accelerate model development by providing large-scale, diverse training data, training foundation-level generative diffusion models in the infrared domain has remained elusive due to limited datasets. In light of such data constraints, we explore an inference-time scaling approach using a domain-adapted CLIP-based verifier for enhanced infrared image generation quality. We adapt FLUX.1-dev, a state-of-the-art text-to-image diffusion model, to the infrared domain by finetuning it on a small sample of infrared images using parameter-efficient techniques. The trained verifier is then employed during inference to guide the diffusion sampling process toward higher quality infrared generations that better align with input text prompts. Empirically, we find that our approach leads to consistent improvements in generation quality, reducing FID scores on the KAIST Multispectral Pedestrian Detection Benchmark dataset by 10% compared to unguided baseline samples. Our results suggest that inference-time guidance offers a promising direction for bridging the domain gap in low-data infrared settings.
