Table of Contents
Fetching ...

Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation

Dian Yu, Qingchuan Zhou, Bingkun Huang, Majid Khadiv, Zewen Yang

TL;DR

Safe-Night VLA is proposed, a multimodal manipulation framework that enables robots to see the unseen while enforcing rigorous safety constraints for thermal-aware manipulation in unstructured environments and provides empirical evidence that foundation models can effectively leverage non-visible physical modalities for robust manipulation.

Abstract

Current Vision-Language-Action (VLA) models rely primarily on RGB perception, preventing them from capturing modalities such as thermal signals that are imperceptible to conventional visual sensors. Moreover, end-to-end generative policies lack explicit safety constraints, making them fragile when encountering obstacles and novel scenarios outside the training distribution. To address these limitations, we propose Safe-Night VLA, a multimodal manipulation framework that enables robots to see the unseen while enforcing rigorous safety constraints for thermal-aware manipulation in unstructured environments. Specifically, Safe-Night VLA integrates long-wave infrared thermal perception into a pre-trained vision-language backbone, enabling semantic reasoning grounded in thermodynamic properties. To ensure safe execution under out-of-distribution conditions, we incorporate a safety filter via control barrier functions, which provide deterministic workspace constraint enforcement during policy execution. We validate our framework through real-world experiments on a Franka manipulator, introducing a novel evaluation paradigm featuring temperature-conditioned manipulation, subsurface target localization, and reflection disambiguation, while maintaining constrained execution at inference time. Results demonstrate that Safe-Night VLA outperforms RGB-only baselines and provide empirical evidence that foundation models can effectively leverage non-visible physical modalities for robust manipulation.

Safe-Night VLA: Seeing the Unseen via Thermal-Perceptive Vision-Language-Action Models for Safety-Critical Manipulation

TL;DR

Safe-Night VLA is proposed, a multimodal manipulation framework that enables robots to see the unseen while enforcing rigorous safety constraints for thermal-aware manipulation in unstructured environments and provides empirical evidence that foundation models can effectively leverage non-visible physical modalities for robust manipulation.

Abstract

Current Vision-Language-Action (VLA) models rely primarily on RGB perception, preventing them from capturing modalities such as thermal signals that are imperceptible to conventional visual sensors. Moreover, end-to-end generative policies lack explicit safety constraints, making them fragile when encountering obstacles and novel scenarios outside the training distribution. To address these limitations, we propose Safe-Night VLA, a multimodal manipulation framework that enables robots to see the unseen while enforcing rigorous safety constraints for thermal-aware manipulation in unstructured environments. Specifically, Safe-Night VLA integrates long-wave infrared thermal perception into a pre-trained vision-language backbone, enabling semantic reasoning grounded in thermodynamic properties. To ensure safe execution under out-of-distribution conditions, we incorporate a safety filter via control barrier functions, which provide deterministic workspace constraint enforcement during policy execution. We validate our framework through real-world experiments on a Franka manipulator, introducing a novel evaluation paradigm featuring temperature-conditioned manipulation, subsurface target localization, and reflection disambiguation, while maintaining constrained execution at inference time. Results demonstrate that Safe-Night VLA outperforms RGB-only baselines and provide empirical evidence that foundation models can effectively leverage non-visible physical modalities for robust manipulation.
Paper Structure (23 sections, 1 equation, 6 figures, 2 tables)

This paper contains 23 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Multimodal perception comparison in downstream tasks. LWIR thermal observations (top row) are shown alongside RGB (middle row) and depth images (bottom row). Temperature-aware recognition: distinguishing thermally distinct yet visually indistinguishable objects for handling of hot vs. cold items; Subsurface localization: detecting targets occluded beneath granular media; Illusion rejection: suppressing mirror-reflection artifacts by leveraging the LWIR attenuation of common glass.
  • Figure 2: System Architecture of Safe-Night VLA.
  • Figure 3: Temperature-Conditioned manipulation.
  • Figure 4: Subsurface Localization.
  • Figure 5: Cross-modal disambiguation under mirror-induced ambiguity.
  • ...and 1 more figures