Table of Contents
Fetching ...

TLCFuse: Temporal Multi-Modality Fusion Towards Occlusion-Aware Semantic Segmentation-Aided Motion Planning

Gustavo Salazar-Gomez, Wenqian Liu, Manuel Diaz-Zapata, David Sierra-Gonzalez, Christian Laugier

TL;DR

This paper proposes a novel architecture that enables the processing of temporal multi-step inputs, where the input at each time step comprises the spatial information encoded from fusing LiDAR and camera sensor readings.

Abstract

In autonomous driving, addressing occlusion scenarios is crucial yet challenging. Robust surrounding perception is essential for handling occlusions and aiding motion planning. State-of-the-art models fuse Lidar and Camera data to produce impressive perception results, but detecting occluded objects remains challenging. In this paper, we emphasize the crucial role of temporal cues by integrating them alongside these modalities to address this challenge. We propose a novel approach for bird's eye view semantic grid segmentation, that leverages sequential sensor data to achieve robustness against occlusions. Our model extracts information from the sensor readings using attention operations and aggregates this information into a lower-dimensional latent representation, enabling thus the processing of multi-step inputs at each prediction step. Moreover, we show how it can also be directly applied to forecast the development of traffic scenes and be seamlessly integrated into a motion planner for trajectory planning. On the semantic segmentation tasks, we evaluate our model on the nuScenes dataset and show that it outperforms other baselines, with particularly large differences when evaluating on occluded and partially-occluded vehicles. Additionally, on motion planning task we are among the early teams to train and evaluate on nuPlan, a cutting-edge large-scale dataset for motion planning.

TLCFuse: Temporal Multi-Modality Fusion Towards Occlusion-Aware Semantic Segmentation-Aided Motion Planning

TL;DR

This paper proposes a novel architecture that enables the processing of temporal multi-step inputs, where the input at each time step comprises the spatial information encoded from fusing LiDAR and camera sensor readings.

Abstract

In autonomous driving, addressing occlusion scenarios is crucial yet challenging. Robust surrounding perception is essential for handling occlusions and aiding motion planning. State-of-the-art models fuse Lidar and Camera data to produce impressive perception results, but detecting occluded objects remains challenging. In this paper, we emphasize the crucial role of temporal cues by integrating them alongside these modalities to address this challenge. We propose a novel approach for bird's eye view semantic grid segmentation, that leverages sequential sensor data to achieve robustness against occlusions. Our model extracts information from the sensor readings using attention operations and aggregates this information into a lower-dimensional latent representation, enabling thus the processing of multi-step inputs at each prediction step. Moreover, we show how it can also be directly applied to forecast the development of traffic scenes and be seamlessly integrated into a motion planner for trajectory planning. On the semantic segmentation tasks, we evaluate our model on the nuScenes dataset and show that it outperforms other baselines, with particularly large differences when evaluating on occluded and partially-occluded vehicles. Additionally, on motion planning task we are among the early teams to train and evaluate on nuPlan, a cutting-edge large-scale dataset for motion planning.
Paper Structure (18 sections, 5 figures, 3 tables)

This paper contains 18 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 2: TLCFuse takes temporal input LiDAR and Camera data, accurately predicts vehicles' locations in the BEV at the reference time, even in highly occluded scenarios. It can also forecast the surrounding traffic scenes up to 2 seconds. As illustrated in the example, the white car visible at $T-1$ becomes occluded by the black SUV at $T$. TLCFuse captures the presence of the white car in its BEV segmentation at time $T$ (circled), and predicts the future trajectory of the moving black SUV (light blue trace within the circle). Leveraging these output BEV maps, TLCFuse forecasts the ego-vehicle's trajectory for the next 5 seconds.
  • Figure 3: Overview of our proposed approach. At the encoding stage, a sequence of consecutive feature tensors at times $T-2$, $T-1$ and $T$ is input, where each feature tensor comprises concatenated domain-specific features from LiDAR, Camera and egomotion. A low-dimensional latent representation $L$ is utilized to extract spatio-temporal information from the input through a cross-attention (CA) layer and a few self-attention (SA) layers sequentially. At the decoding stage, a BEV query is employed to extract information from $L$, which is then stored in the BEV feature vector through a cross-attention operation. Subsequently, a CNN module is applied to refine the BEV feature into the target semantic BEV grid.
  • Figure 4: We introduce a trajectory forecasting network designed to anticipate the future trajectory of the ego-vehicle. 5 BEV grid predictions of the Vehicle and 1 BEV grid of the Drivable Area serve as inputs to the predictor. The network analyzes these inputs to discern the most probable route for the ego-vehicle over the next 5 seconds.
  • Figure 5: Qualitative example of occlusion-aware BEV segmentation. The back-left camera of the ego-vehicle records two black cars being occluded by a silver car. Prior works LSS philion-lss, FIERY hu2021fiery and LaRa bartoccioni2022lara fail to generate the locations of the occluded cars in their semantic maps (indicated by a red box). While TLCFuse correctly locates the occluded vehicles (indicated by a green box). Additionally, TLCFuse generate the clearest and sharpest semantic map than the others (see the areas indicated by green circles).
  • Figure 6: (a) Qualitative results of one-shot multi-step future BEV grid prediction on the nuScenes dataset are presented. We compare TLCFuse's results against the state-of-the-art model Fieryhu2021fiery and the ground truth. The predictions are color-coded for 2 seconds into the future, with green boxes and circles indicating accurate predictions compared to the ground truth, and red ones representing mistakes. TLCFuse produces qualitatively satisfying predictions, showcasing its effective performance despite its naive prediction mechanism. (b) Qualitative results of trajectory forecasting on nuScenes dataset. The predicted 5seconds trajectory of the ego-vehicle is shown in red dots overlayed onto the semantic maps of the drivable area and surrounding vehicles at time 0s. The groundtruth future trajectory is drew in a green line. We see that our model predicts reasonable and accurate trajectory for the ego-vehicle.