Table of Contents
Fetching ...

DepthVision: Enabling Robust Vision-Language Models with GAN-Based LiDAR-to-RGB Synthesis for Autonomous Driving

Sven Kirchner, Nils Purschke, Ross Greer, Alois C. Knoll

TL;DR

DepthVision addresses the vulnerability of vision–language models to degraded RGB input in autonomous driving by converting sparse LiDAR data into dense, RGB-like imagery via a conditional GAN with a lightweight refiner. It then adaptively fuses synthesized and real RGB frames using LAMA, guided by scene luminance, and feeds the result to frozen VLMs, enabling robust multimodal reasoning without retraining. Across CARLA and nuScenes, including vehicle-in-the-loop tests, DepthVision yields notable gains in night-time perception and visual question answering while maintaining compatibility with existing VLM architectures. The approach demonstrates a practical pathway for integrating range sensing into vision–language systems, extending the operational envelope of autonomous perception under challenging illumination and degradation conditions.

Abstract

Ensuring reliable autonomous operation when visual input is degraded remains a key challenge in intelligent vehicles and robotics. We present DepthVision, a multimodal framework that enables Vision--Language Models (VLMs) to exploit LiDAR data without any architectural changes or retraining. DepthVision synthesizes dense, RGB-like images from sparse LiDAR point clouds using a conditional GAN with an integrated refiner, and feeds these into off-the-shelf VLMs through their standard visual interface. A Luminance-Aware Modality Adaptation (LAMA) module fuses synthesized and real camera images by dynamically weighting each modality based on ambient lighting, compensating for degradation such as darkness or motion blur. This design turns LiDAR into a drop-in visual surrogate when RGB becomes unreliable, effectively extending the operational envelope of existing VLMs. We evaluate DepthVision on real and simulated datasets across multiple VLMs and safety-critical tasks, including vehicle-in-the-loop experiments. The results show substantial improvements in low-light scene understanding over RGB-only baselines while preserving full compatibility with frozen VLM architectures. These findings demonstrate that LiDAR-guided RGB synthesis is a practical pathway for integrating range sensing into modern vision-language systems for autonomous driving.

DepthVision: Enabling Robust Vision-Language Models with GAN-Based LiDAR-to-RGB Synthesis for Autonomous Driving

TL;DR

DepthVision addresses the vulnerability of vision–language models to degraded RGB input in autonomous driving by converting sparse LiDAR data into dense, RGB-like imagery via a conditional GAN with a lightweight refiner. It then adaptively fuses synthesized and real RGB frames using LAMA, guided by scene luminance, and feeds the result to frozen VLMs, enabling robust multimodal reasoning without retraining. Across CARLA and nuScenes, including vehicle-in-the-loop tests, DepthVision yields notable gains in night-time perception and visual question answering while maintaining compatibility with existing VLM architectures. The approach demonstrates a practical pathway for integrating range sensing into vision–language systems, extending the operational envelope of autonomous perception under challenging illumination and degradation conditions.

Abstract

Ensuring reliable autonomous operation when visual input is degraded remains a key challenge in intelligent vehicles and robotics. We present DepthVision, a multimodal framework that enables Vision--Language Models (VLMs) to exploit LiDAR data without any architectural changes or retraining. DepthVision synthesizes dense, RGB-like images from sparse LiDAR point clouds using a conditional GAN with an integrated refiner, and feeds these into off-the-shelf VLMs through their standard visual interface. A Luminance-Aware Modality Adaptation (LAMA) module fuses synthesized and real camera images by dynamically weighting each modality based on ambient lighting, compensating for degradation such as darkness or motion blur. This design turns LiDAR into a drop-in visual surrogate when RGB becomes unreliable, effectively extending the operational envelope of existing VLMs. We evaluate DepthVision on real and simulated datasets across multiple VLMs and safety-critical tasks, including vehicle-in-the-loop experiments. The results show substantial improvements in low-light scene understanding over RGB-only baselines while preserving full compatibility with frozen VLM architectures. These findings demonstrate that LiDAR-guided RGB synthesis is a practical pathway for integrating range sensing into modern vision-language systems for autonomous driving.

Paper Structure

This paper contains 20 sections, 12 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: DepthVision architecture with four components: LiDAR processing (blue), RGB preprocessing (orange), luminance-aware fusion via LAMA (yellow), and the Vision–Language Model (red). LiDAR inputs are converted into RGB-like images through a GAN–refiner pipeline, and the VLM performs action or scene understanding. The system adaptively selects or blends modalities based on scene brightness to provide a robust visual input to the VLM.
  • Figure 2: LiDAR-to-camera projection geometry. A 3D point measured in the LiDAR frame $(x_L,y_L,z_L)$ is transformed into the camera frame $(x_C,y_C,z_C)$ using the extrinsics $\mathbf{T}_{C\leftarrow L}$ and then projected to pixel coordinates $p(u,v)$ by the intrinsics $\mathbf{K}$.
  • Figure 3: LiDAR-to-RGB synthesis pipeline in DepthVision. The 3D point cloud is projected and interpolated into a 2D depth map, which is translated to an RGB image by a U-Net GAN and iteratively refined by a lightweight refiner network.
  • Figure 4: Overview of the two luminance-aware fusion strategies used for combining RGB and LiDAR modalities. (a) full fusion applies a single global weighting factor to both inputs, determined by the mean scene luminance of the RGB input. (b) pixelwise fusion computes a spatially varying weight map based on per-pixel luminance, allowing the model to blend information locally and adaptively. This enables fine-grained emphasis on LiDAR in poorly lit regions while still leveraging RGB information in well-exposed areas.
  • Figure 5: Vehicle-in-the-loop setup for integration testing. The vehicle computer receives synthetic sensor data from a simulation hosting a digital twin in a virtual environment and returns vehicle commands via the AD (autonomous driving) layer, middleware, and a redundant CAN interface. These commands actuate the vehicle, which operates on a physical testbench through a mechanical interface and motion controller. The measured vehicle movement is fed back to the simulation, thus closing the loop.
  • ...and 6 more figures