Table of Contents
Fetching ...

FIN: Fast Inference Network for Map Segmentation

Ruan Bispo, Tim Brophy, Reenu Mohandas, Anthony Scanlan, Ciarán Eising

TL;DR

This work tackles real-time map segmentation for autonomous driving by introducing FIN, a camera–radar fusion network that operates in BEV space. FIN combines a ResNet-50 image backbone, a PAN radar backbone, a radar‑assisted BEV projection (RVT), cross-modal MDCA fusion, and a lightweight U‑Net–based head, trained with a six-term loss set to balance accuracy and boundary precision. It achieves a mean IoU of 53.5 on nuScenes while running at ~26 FPS on an NVIDIA A100, representing a 260% speedup over strong baselines and demonstrating robust performance across challenging weather and lighting conditions. The results indicate FIN can deliver high-fidelity, real-time map segmentation with balanced per-class results, supporting safer planning and trajectory prediction in dynamic driving environments, while also highlighting remaining challenges in occluded and distant regions for future work.

Abstract

Multi-sensor fusion in autonomous vehicles is becoming more common to offer a more robust alternative for several perception tasks. This need arises from the unique contribution of each sensor in collecting data: camera-radar fusion offers a cost-effective solution by combining rich semantic information from cameras with accurate distance measurements from radar, without incurring excessive financial costs or overwhelming data processing requirements. Map segmentation is a critical task for enabling effective vehicle behaviour in its environment, yet it continues to face significant challenges in achieving high accuracy and meeting real-time performance requirements. Therefore, this work presents a novel and efficient map segmentation architecture, using cameras and radars, in the \acrfull{bev} space. Our model introduces a real-time map segmentation architecture considering aspects such as high accuracy, per-class balancing, and inference time. To accomplish this, we use an advanced loss set together with a new lightweight head to improve the perception results. Our results show that, with these modifications, our approach achieves results comparable to large models, reaching 53.5 mIoU, while also setting a new benchmark for inference time, improving it by 260\% over the strongest baseline models.

FIN: Fast Inference Network for Map Segmentation

TL;DR

This work tackles real-time map segmentation for autonomous driving by introducing FIN, a camera–radar fusion network that operates in BEV space. FIN combines a ResNet-50 image backbone, a PAN radar backbone, a radar‑assisted BEV projection (RVT), cross-modal MDCA fusion, and a lightweight U‑Net–based head, trained with a six-term loss set to balance accuracy and boundary precision. It achieves a mean IoU of 53.5 on nuScenes while running at ~26 FPS on an NVIDIA A100, representing a 260% speedup over strong baselines and demonstrating robust performance across challenging weather and lighting conditions. The results indicate FIN can deliver high-fidelity, real-time map segmentation with balanced per-class results, supporting safer planning and trajectory prediction in dynamic driving environments, while also highlighting remaining challenges in occluded and distant regions for future work.

Abstract

Multi-sensor fusion in autonomous vehicles is becoming more common to offer a more robust alternative for several perception tasks. This need arises from the unique contribution of each sensor in collecting data: camera-radar fusion offers a cost-effective solution by combining rich semantic information from cameras with accurate distance measurements from radar, without incurring excessive financial costs or overwhelming data processing requirements. Map segmentation is a critical task for enabling effective vehicle behaviour in its environment, yet it continues to face significant challenges in achieving high accuracy and meeting real-time performance requirements. Therefore, this work presents a novel and efficient map segmentation architecture, using cameras and radars, in the \acrfull{bev} space. Our model introduces a real-time map segmentation architecture considering aspects such as high accuracy, per-class balancing, and inference time. To accomplish this, we use an advanced loss set together with a new lightweight head to improve the perception results. Our results show that, with these modifications, our approach achieves results comparable to large models, reaching 53.5 mIoU, while also setting a new benchmark for inference time, improving it by 260\% over the strongest baseline models.

Paper Structure

This paper contains 17 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The overall architecture of fin is as follows. First, radar and camera features are extracted in parallel using their respective backbones. Second, using the rvt module, we convert these features into a bev representation. Third, the Multi-Modal Feature Aggregation module merges and refines the radar and camera bev features, which are then fed into the Segmentation Head.
  • Figure 2: Map Segmentation Head showing how the bev features are transformed into the map segmentation, where H, W, and C are the bev height, width, and channels, respectively. The skip connections are concatenated to the UpConv features before the convolutional block, except at the bottleneck stage, where no upsampled features exist; in that case, the features are simply forwarded (white box) without concatenation.
  • Figure 3: miou degradation per class under different distances to the ego vehicle. In this experiment, the full validation set was used. The blue area [0,25] represents the safe driving distance considering the speed limit of 50 km/h and 1 s of reaction time.
  • Figure 4: miou degradation per weather condition under different distances to the ego vehicle. The blue area [0,25] represents the safe driving distance considering the speed limit of 50 km/h and 1 s of reaction time.
  • Figure 5: Visualisation for the qualitative results. The results shown in (a) and (b) are our results, FIN, under two different scenes.