AnyThermal: Towards Learning Universal Representations for Thermal Perception

Parv Maheshwari; Jay Karhade; Yogesh Chawla; Isaiah Adu; Florian Heisen; Andrew Porco; Andrew Jong; Yifei Liu; Santosh Pitla; Sebastian Scherer; Wenshan Wang

AnyThermal: Towards Learning Universal Representations for Thermal Perception

Parv Maheshwari, Jay Karhade, Yogesh Chawla, Isaiah Adu, Florian Heisen, Andrew Porco, Andrew Jong, Yifei Liu, Santosh Pitla, Sebastian Scherer, Wenshan Wang

TL;DR

This work tackles the scarcity and limited diversity of thermal data by introducing AnyThermal, a task-agnostic thermal encoder distilled from RGB foundation models. By combining RGB-to-thermal knowledge distillation across multiple environments, AnyThermal achieves state-of-the-art results on thermal segmentation, cross-modal place recognition, and monocular thermal depth estimation, without task-specific finetuning. The authors also present the TartanRGBT platform and dataset to systematically expand multi-domain RGB-T data, enabling scalable improvement and open community contribution. Across diverse environments and tasks, the approach demonstrates that diverse training data is key to robust generalization, achieving improvements up to 36% over baselines and highlighting the platform’s potential for broad adoption in thermal perception pipelines.

Abstract

We present AnyThermal, a thermal backbone that captures robust task-agnostic thermal features suitable for a variety of tasks such as cross-modal place recognition, thermal segmentation, and monocular depth estimation using thermal images. Existing thermal backbones that follow task-specific training from small-scale data result in utility limited to a specific environment and task. Unlike prior methods, AnyThermal can be used for a wide range of environments (indoor, aerial, off-road, urban) and tasks, all without task-specific training. Our key insight is to distill the feature representations from visual foundation models such as DINOv2 into a thermal encoder using thermal data from these multiple environments. To bridge the diversity gap of the existing RGB-Thermal datasets, we introduce the TartanRGBT platform, the first open-source data collection platform with synced RGB-Thermal image acquisition. We use this payload to collect the TartanRGBT dataset - a diverse and balanced dataset collected in 4 environments. We demonstrate the efficacy of AnyThermal and TartanRGBT, achieving state-of-the-art results with improvements of up to 36% across diverse environments and downstream tasks on existing datasets.

AnyThermal: Towards Learning Universal Representations for Thermal Perception

TL;DR

Abstract

Paper Structure (33 sections, 7 figures, 4 tables)

This paper contains 33 sections, 7 figures, 4 tables.

INTRODUCTION
Related Works
Thermal Images for Robot Perception
Multi-modal Foundation Models
RGB-T Datasets
AnyThermal: Thermal Feature-Extraction Backbone
Overview
Knowledge Distillation
Task-Specific Head and Training
Cross-Modal Place Recognition
Thermal Segmentation
Mono-Thermal Depth Estimation
TartanRGBT Platform
CAD design and 3D printing
Time Syncing
...and 18 more sections

Figures (7)

Figure 1: We perform knowledge distillation between a frozen DINOv2 and a trainable DINOv2 network (AnyThermal), both initialized with pre-trained DINOv2 weights. The frozen network serves as the teacher, while the trainable AnyThermal backbone learns from it. Pre-trained initialization enables AnyThermal to generalize across environments, and distillation on thermal images allows it to extract meaningful thermal features. Training is task-agnostic, using self-supervised losses between thermal features from AnyThermal and RGB features from the frozen teacher. This approach requires no labels and scales naturally with increasing RGB-T datasets.
Figure 2: Left: CAD model of the TartanRGBT system with half of the camera's and payload's casing hidden. Numbered components: (1) ZED X stereo camera; (2) Teledyne FLIR Boson 640 × 512, 4.9 mm, 95° HFoV, short-lens Shutterless LWIR thermal camera; (3) 5 V, 30 mm blower fan; (4) Wi-Fi antennae; (5) copper heat sinks (surrounding the thermal camera body); (6) NVIDIA Jetson AGX Orin Developer Kit, 64 GB; (7) Makita 18 V LXT® lithium-ion 4.0 Ah battery with adapter; (8) power switch; (9) recording button. Right: Overview of the connections between components, showing power (orange), sensor data transfer (green), and signal transfer(pink) —time synchronization and recording button trigger.
Figure 3: Thermal checkerboard calibration image before (left) and after (right) fisheye rectification
Figure 4: RGB–Thermal Registration in the TartanRGBT dataset: alpha-blended overlays for indoor, off-road, and urban domains with blending factors $\alpha \in \{0.00, 0.50, 1.00\}$. Due to sensor geometry (thermal mounted below RGB), the thermal view includes more of the lower scene, resulting in additional pixels at the bottom of the thermal images that are not present in the RGB images, producing black regions where RGB pixels are absent.
Figure 5: Cross-Modal VPR on OdomBeyondVision:Top: PaCMAPpacmap representations show SALAD poorly(far) aligns RGB–Thermal embeddings, while AnyThermal-VPR aligns them well in a shared representation space. Bottom: Example queries where SALAD fails to retrieve the correct RGB match, but AnyThermal-VPR succeeds, with key clues circled.
...and 2 more figures

AnyThermal: Towards Learning Universal Representations for Thermal Perception

TL;DR

Abstract

AnyThermal: Towards Learning Universal Representations for Thermal Perception

Authors

TL;DR

Abstract

Table of Contents

Figures (7)