Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation

Dian Yu; Qingchuan Zhou; Bingkun Huang; Majid Khadiv; Zewen Yang

Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation

Dian Yu, Qingchuan Zhou, Bingkun Huang, Majid Khadiv, Zewen Yang

TL;DR

Safe-Night VLA is proposed, a multimodal manipulation framework that enables robots to see the unseen while enforcing rigorous safety constraints for thermal-aware manipulation in unstructured environments and provides empirical evidence that foundation models can effectively leverage non-visible physical modalities for robust manipulation.

Abstract

Current Vision-Language-Action (VLA) models rely primarily on RGB perception, preventing them from capturing modalities such as thermal signals that are imperceptible to conventional visual sensors. Moreover, end-to-end generative policies lack explicit safety constraints, making them fragile when encountering obstacles and novel scenarios outside the training distribution. To address these limitations, we propose Safe-Night VLA, a multimodal manipulation framework that enables robots to see the unseen while enforcing rigorous safety constraints for thermal-aware manipulation in unstructured environments. Specifically, Safe-Night VLA integrates long-wave infrared thermal perception into a pre-trained vision-language backbone, enabling semantic reasoning grounded in thermodynamic properties. To ensure safe execution under out-of-distribution conditions, we incorporate a safety filter via control barrier functions, which provide deterministic workspace constraint enforcement during policy execution. We validate our framework through real-world experiments on a Franka manipulator, introducing a novel evaluation paradigm featuring temperature-conditioned manipulation, subsurface target localization, and reflection disambiguation, while maintaining constrained execution at inference time. Results demonstrate that Safe-Night VLA outperforms RGB-only baselines and provide empirical evidence that foundation models can effectively leverage non-visible physical modalities for robust manipulation.

Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation

TL;DR

Abstract

Paper Structure (23 sections, 1 equation, 6 figures, 2 tables)

This paper contains 23 sections, 1 equation, 6 figures, 2 tables.

INTRODUCTION
RELATED WORK
METHODOLOGY
System Architecture and Adaptation Strategy
Multimodal Input Processing
Safety Guarantee
EXPERIMENTS
Experimental Setup
Dual-Arm Teleoperation Platform
Sensor Setup and Input Processing
Data Collection
Training Configuration
Inference and Hardware Deployment
Scenario 1: Temperature-Conditioned Manipulation
Scenario 2: Subsurface Localization
...and 8 more sections

Figures (6)

Figure 1: Multimodal perception comparison in downstream tasks. LWIR thermal observations (top row) are shown alongside RGB (middle row) and depth images (bottom row). Temperature-aware recognition: distinguishing thermally distinct yet visually indistinguishable objects for handling of hot vs. cold items; Subsurface localization: detecting targets occluded beneath granular media; Illusion rejection: suppressing mirror-reflection artifacts by leveraging the LWIR attenuation of common glass.
Figure 2: System Architecture of Safe-Night VLA.
Figure 3: Temperature-Conditioned manipulation.
Figure 4: Subsurface Localization.
Figure 5: Cross-modal disambiguation under mirror-induced ambiguity.
...and 1 more figures

Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation

TL;DR

Abstract

Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)