Table of Contents
Fetching ...

LeGo-Drive: Language-enhanced Goal-oriented Closed-Loop End-to-End Autonomous Driving

Pranjal Paul, Anant Garg, Tushar Choudhary, Arun Kumar Singh, K. Madhava Krishna

TL;DR

LeGo-Drive tackles language-conditioned autonomous driving by predicting a language-guided goal and jointly optimizing a differentiable trajectory planner within an end-to-end framework. The architecture combines a Visual-Language Network with a Frenet-frame differentiable planner, enabling iterative refinement of both the goal and the trajectory under vehicle and scene constraints. Through the LeGo-Drive dataset and end-to-end training (with loss terms $\mathcal{L}_{goal}$ and $\mathcal{L}_{planner}$), the approach achieves higher goal reachability and smoother, collision-free trajectories across diverse commands, outperforming baselines, including open-loop and decoupled variants. The results suggest strong potential for practical deployment in vision-language guided autonomous driving and intelligent transportation systems.

Abstract

Existing Vision-Language models (VLMs) estimate either long-term trajectory waypoints or a set of control actions as a reactive solution for closed-loop planning based on their rich scene comprehension. However, these estimations are coarse and are subjective to their "world understanding" which may generate sub-optimal decisions due to perception errors. In this paper, we introduce LeGo-Drive, which aims to address this issue by estimating a goal location based on the given language command as an intermediate representation in an end-to-end setting. The estimated goal might fall in a non-desirable region, like on top of a car for a parking-like command, leading to inadequate planning. Hence, we propose to train the architecture in an end-to-end manner, resulting in iterative refinement of both the goal and the trajectory collectively. We validate the effectiveness of our method through comprehensive experiments conducted in diverse simulated environments. We report significant improvements in standard autonomous driving metrics, with a goal reaching Success Rate of 81%. We further showcase the versatility of LeGo-Drive across different driving scenarios and linguistic inputs, underscoring its potential for practical deployment in autonomous vehicles and intelligent transportation systems.

LeGo-Drive: Language-enhanced Goal-oriented Closed-Loop End-to-End Autonomous Driving

TL;DR

LeGo-Drive tackles language-conditioned autonomous driving by predicting a language-guided goal and jointly optimizing a differentiable trajectory planner within an end-to-end framework. The architecture combines a Visual-Language Network with a Frenet-frame differentiable planner, enabling iterative refinement of both the goal and the trajectory under vehicle and scene constraints. Through the LeGo-Drive dataset and end-to-end training (with loss terms and ), the approach achieves higher goal reachability and smoother, collision-free trajectories across diverse commands, outperforming baselines, including open-loop and decoupled variants. The results suggest strong potential for practical deployment in vision-language guided autonomous driving and intelligent transportation systems.

Abstract

Existing Vision-Language models (VLMs) estimate either long-term trajectory waypoints or a set of control actions as a reactive solution for closed-loop planning based on their rich scene comprehension. However, these estimations are coarse and are subjective to their "world understanding" which may generate sub-optimal decisions due to perception errors. In this paper, we introduce LeGo-Drive, which aims to address this issue by estimating a goal location based on the given language command as an intermediate representation in an end-to-end setting. The estimated goal might fall in a non-desirable region, like on top of a car for a parking-like command, leading to inadequate planning. Hence, we propose to train the architecture in an end-to-end manner, resulting in iterative refinement of both the goal and the trajectory collectively. We validate the effectiveness of our method through comprehensive experiments conducted in diverse simulated environments. We report significant improvements in standard autonomous driving metrics, with a goal reaching Success Rate of 81%. We further showcase the versatility of LeGo-Drive across different driving scenarios and linguistic inputs, underscoring its potential for practical deployment in autonomous vehicles and intelligent transportation systems.
Paper Structure (18 sections, 3 equations, 5 figures, 4 tables)

This paper contains 18 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The proposed method, LeGo-Drive estimates a goal location queried with a navigation instruction- Park near the bus stop on the front-left on a single front-facing camera image and coupled it with a differentiable optimizer-based planner that jointly optimizes the trajectory and the goal location. (Left) The proposed architecture is shown along with the gradient flow for joint end-to-end training. (Right-Top) Goal improvement from initial estimation in Green to improved location in Red. (Right-Bottom) Trajectory output in Green leading to the improved goal location, compared with the trajectory generated by baseline in Red.
  • Figure 2: LeGo-Drive Architecture: Our architecture comprises of two modules: (A) Goal Prediction module and (B) Differentiable Trajectory Planner. We propose the advantage of end-to-end training for combined goal and trajectory improvement, for which the gradient-flow is clearly shown. (Refer to Section IV-B for trajectory variable definition)
  • Figure 3: Goal Improvement for different object-centric parking commands. (Left) Front-view image on which command is queried. (Right) Top-down view of the scene. The goal location improves from an undesirable location in Green (On top of the car in (a) and at the curb edge in (b)) to a reachable location in Red
  • Figure 4: Results for the case of turning commands. In both images (top, bottom), the initial goal in Green is at a higher offset from the lane centre. The model approximates the improved version shown in Red to the lane centre
  • Figure 5: Qualitative Result of the Trajectory Improvement for different navigation instruction leading to an improved goal. The baseline ST-P3 trajectory shown in Red consistently plans a non-smooth trajectory compared to Ours, shown in Green. The third image in all the rows shows our planning in Frenet frame with Red rectangle as ego-vehicle, Blue as surrounding vehicles and Red cross shows the goal location along with lane bounds in solid Black lines