Wild-Drive: Off-Road Scene Captioning and Path Planning via Robust Multi-modal Routing and Efficient Large Language Model

Zihang Wang; Xu Li; Benwu Wang; Wenkai Zhu; Xieyuanli Chen; Dong Kong; Kailin Lyu; Yinan Du; Yiming Peng; Haoyang Che

Wild-Drive: Off-Road Scene Captioning and Path Planning via Robust Multi-modal Routing and Efficient Large Language Model

Zihang Wang, Xu Li, Benwu Wang, Wenkai Zhu, Xieyuanli Chen, Dong Kong, Kailin Lyu, Yinan Du, Yiming Peng, Haoyang Che

TL;DR

Experiments show that Wild-Drive outperforms prior LLM-based methods and remains more stable under degraded sensing, and the benchmark, which covers structured off-road scene captioning and path planning under diverse sensor corruption conditions, is built.

Abstract

Explainability and transparent decision-making are essential for the safe deployment of autonomous driving systems. Scene captioning summarizes environmental conditions and risk factors in natural language, improving transparency, safety, and human--robot interaction. However, most existing approaches target structured urban scenarios; in off-road environments, they are vulnerable to single-modality degradations caused by rain, fog, snow, and darkness, and they lack a unified framework that jointly models structured scene captioning and path planning. To bridge this gap, we propose Wild-Drive, an efficient framework for off-road scene captioning and path planning. Wild-Drive adopts modern multimodal encoders and introduces a task-conditioned modality-routing bridge, MoRo-Former, to adaptively aggregate reliable information under degraded sensing. It then integrates an efficient large language model (LLM), together with a planning token and a gate recurrent unit (GRU) decoder, to generate structured captions and predict future trajectories. We also build the OR-C2P Benchmark, which covers structured off-road scene captioning and path planning under diverse sensor corruption conditions. Experiments on OR-C2P dataset and a self-collected dataset show that Wild-Drive outperforms prior LLM-based methods and remains more stable under degraded sensing. The code and benchmark will be publicly available at https://github.com/wangzihanggg/Wild-Drive.

Wild-Drive: Off-Road Scene Captioning and Path Planning via Robust Multi-modal Routing and Efficient Large Language Model

TL;DR

Abstract

Paper Structure (21 sections, 10 equations, 5 figures, 4 tables)

This paper contains 21 sections, 10 equations, 5 figures, 4 tables.

INTRODUCTION
RELATED WORK
Off-Road Autonomous Driving
Scene Caption for Autonomous Driving
PROPOSED METHOD
Overall Pipeline
Modality Routing Transformer with Grouped Queries
Large Language Model and Path Planning
Loss Functions
The OR-C2P Benchmark
LLM Q&A Generation
Path Planning
Experiments
Datasets and Metrics
Implementation Details
...and 6 more sections

Figures (5)

Figure 1: Wild-Drive unifies camera-LiDAR as input for off-road scene structured captioning and path planning, and provides adaptability to sensor corruption.
Figure 2: The overview of our proposed Wild-Drive. It fuses camera–LiDAR features into multimodal tokens and uses MoRo-Former for query-conditioned routing and token compression. An LLM generates scene captions and planning tokens, which a GRU-based planner decodes into multimodal trajectories.
Figure 3: The MoRo-Former module. It performs locality-aware masked attention to predict modality probabilities and hard-route task queries to LiDAR, camera, or fusion experts. The routed features are aggregated and compressed into compact task tokens for the LLM.
Figure 4: Experimental setup and data collection route map of the self-collected dataset
Figure 5: Quantitative analysis of Wild-Drive for scene captioning and path planning on the OR-C2P dataset. In the path planning visualizations, red indicates the model-predicted trajectory and green indicates the ground truth trajectory.

Wild-Drive: Off-Road Scene Captioning and Path Planning via Robust Multi-modal Routing and Efficient Large Language Model

TL;DR

Abstract

Wild-Drive: Off-Road Scene Captioning and Path Planning via Robust Multi-modal Routing and Efficient Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (5)