Table of Contents
Fetching ...

See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models

Kebin Contreras, Luis Toscano-Palomino, Mauro Dalla Mura, Jorge Bacca

TL;DR

This paper tackles the challenge of reconstructing recent past scenes from current observations by leveraging fading thermal traces as passive temporal codes alongside RGB context. It introduces a framework that couples Visual Language Models with a constrained diffusion model, where a frozen VLM generates semantic scene descriptions conditioned on RGB and thermal inputs, guiding a pretrained diffusion backbone to synthesize plausible past frames without retraining. The approach is validated in controlled scenarios, showing that semantic priors and thermal cues improve both low-level fidelity and high-level semantics, with reconstructions credible up to about 1–2 minutes in the past. This work suggests a promising direction for time-reversed imaging with potential applications in forensics, scene analysis, and security, while outlining future work to handle real-world variability and multi-subject dynamics.

Abstract

Recovering the past from present observations is an intriguing challenge with potential applications in forensics and scene analysis. Thermal imaging, operating in the infrared range, provides access to otherwise invisible information. Since humans are typically warmer (37 C -98.6 F) than their surroundings, interactions such as sitting, touching, or leaning leave residual heat traces. These fading imprints serve as passive temporal codes, allowing for the inference of recent events that exceed the capabilities of RGB cameras. This work proposes a time-reversed reconstruction framework that uses paired RGB and thermal images to recover scene states from a few seconds earlier. The proposed approach couples Visual-Language Models (VLMs) with a constrained diffusion process, where one VLM generates scene descriptions and another guides image reconstruction, ensuring semantic and structural consistency. The method is evaluated in three controlled scenarios, demonstrating the feasibility of reconstructing plausible past frames up to 120 seconds earlier, providing a first step toward time-reversed imaging from thermal traces.

See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models

TL;DR

This paper tackles the challenge of reconstructing recent past scenes from current observations by leveraging fading thermal traces as passive temporal codes alongside RGB context. It introduces a framework that couples Visual Language Models with a constrained diffusion model, where a frozen VLM generates semantic scene descriptions conditioned on RGB and thermal inputs, guiding a pretrained diffusion backbone to synthesize plausible past frames without retraining. The approach is validated in controlled scenarios, showing that semantic priors and thermal cues improve both low-level fidelity and high-level semantics, with reconstructions credible up to about 1–2 minutes in the past. This work suggests a promising direction for time-reversed imaging with potential applications in forensics, scene analysis, and security, while outlining future work to handle real-world variability and multi-subject dynamics.

Abstract

Recovering the past from present observations is an intriguing challenge with potential applications in forensics and scene analysis. Thermal imaging, operating in the infrared range, provides access to otherwise invisible information. Since humans are typically warmer (37 C -98.6 F) than their surroundings, interactions such as sitting, touching, or leaning leave residual heat traces. These fading imprints serve as passive temporal codes, allowing for the inference of recent events that exceed the capabilities of RGB cameras. This work proposes a time-reversed reconstruction framework that uses paired RGB and thermal images to recover scene states from a few seconds earlier. The proposed approach couples Visual-Language Models (VLMs) with a constrained diffusion process, where one VLM generates scene descriptions and another guides image reconstruction, ensuring semantic and structural consistency. The method is evaluated in three controlled scenarios, demonstrating the feasibility of reconstructing plausible past frames up to 120 seconds earlier, providing a first step toward time-reversed imaging from thermal traces.

Paper Structure

This paper contains 14 sections, 3 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Illustrative example of residual thermal traces. (Left) RGB and thermal images captured at the present frame. While the RGB view shows only the current visual appearance, the thermal image reveals heat imprints on the chair, acting as passive temporal codes and suggesting a prior human action. (Right) The ground-truth frame of the interaction that occurred approximately two minutes earlier.
  • Figure 2: Ablation study comparing RGB only, RGB+Thermal, and RGB+Thermal+Descriptor against ground truth, evaluated using low-level metrics (PSNR, SSIM) and high-level metrics (semantic segmentation with OA, pose estimation with MPJPE) to see the past, 30 seconds ago.
  • Figure 3: Input and ground-truth data for the supplementary material experiments. Each test scene includes the thermal image (left) capturing residual heat traces, the corresponding RGB image (middle), and the ground-truth action (right). Results are illustrated for Scene 1 (Test 1) and Scene 2 (Test 2)
  • Figure 4: Examples of past-action descriptions generated under different prompt descriptors $p_{\text{desc}}^{1}$--$p_{\text{desc}}^{4}$. The prompts progressively add reasoning constraints, from a simple direct description to structured analyses that incorporate thermal traces, RGB context, and completeness checks. More structured prompts yield descriptions that are more specific and consistent with the evidence.
  • Figure 5: Examples of past-action image predictions generated under different prompt descriptors $p^{1}_{edit}$--$p^{4}_{edit}$. Each prompt introduces progressively stricter editing constraints, from preserving the original environment with minimal changes to enforcing temporal consistency and position replacement. In all cases, the underlying action description used in the prompts was the same: "the person was sitting and holding the book". The structured prompts guide the editing model toward outputs that are more coherent with the thermal trace, scene context, and past actions, demonstrating the impact of prompt design on editing quality and consistency.
  • ...and 3 more figures