Table of Contents
Fetching ...

HSDA: High-frequency Shuffle Data Augmentation for Bird's-Eye-View Map Segmentation

Calvin Glisson, Qiuxiao Chen

TL;DR

This work shows that high-frequency content in camera images is essential for accurate BEV map segmentation. It introduces HSDA, a lightweight FFT-based augmentation that randomly shuffles dominant high-frequency magnitudes in a randomly chosen color channel, preserving ground truth while challenging the model to relate high-frequency cues to the BEV map. The method attains state-of-the-art camera-only BEV performance on nuScenes (mIoU $=61.3\%$) when combined with a strong baseline (RGC), and its benefits generalize to other perception tasks like monocular 3D detection on KITTI. Overall, HSDA demonstrates that frequency-domain augmentation can effectively enhance edge- and detail-focused segmentation without architectural changes.

Abstract

Autonomous driving has garnered significant attention in recent research, and Bird's-Eye-View (BEV) map segmentation plays a vital role in the field, providing the basis for safe and reliable operation. While data augmentation is a commonly used technique for improving BEV map segmentation networks, existing approaches predominantly focus on manipulating spatial domain representations. In this work, we investigate the potential of frequency domain data augmentation for camera-based BEV map segmentation. We observe that high-frequency information in camera images is particularly crucial for accurate segmentation. Based on this insight, we propose High-frequency Shuffle Data Augmentation (HSDA), a novel data augmentation strategy that enhances a network's ability to interpret high-frequency image content. This approach encourages the network to distinguish relevant high-frequency information from noise, leading to improved segmentation results for small and intricate image regions, as well as sharper edge and detail perception. Evaluated on the nuScenes dataset, our method demonstrates broad applicability across various BEV map segmentation networks, achieving a new state-of-the-art mean Intersection over Union (mIoU) of 61.3% for camera-only systems. This significant improvement underscores the potential of frequency domain data augmentation for advancing the field of autonomous driving perception. Code has been released: https://github.com/Zarhult/HSDA

HSDA: High-frequency Shuffle Data Augmentation for Bird's-Eye-View Map Segmentation

TL;DR

This work shows that high-frequency content in camera images is essential for accurate BEV map segmentation. It introduces HSDA, a lightweight FFT-based augmentation that randomly shuffles dominant high-frequency magnitudes in a randomly chosen color channel, preserving ground truth while challenging the model to relate high-frequency cues to the BEV map. The method attains state-of-the-art camera-only BEV performance on nuScenes (mIoU ) when combined with a strong baseline (RGC), and its benefits generalize to other perception tasks like monocular 3D detection on KITTI. Overall, HSDA demonstrates that frequency-domain augmentation can effectively enhance edge- and detail-focused segmentation without architectural changes.

Abstract

Autonomous driving has garnered significant attention in recent research, and Bird's-Eye-View (BEV) map segmentation plays a vital role in the field, providing the basis for safe and reliable operation. While data augmentation is a commonly used technique for improving BEV map segmentation networks, existing approaches predominantly focus on manipulating spatial domain representations. In this work, we investigate the potential of frequency domain data augmentation for camera-based BEV map segmentation. We observe that high-frequency information in camera images is particularly crucial for accurate segmentation. Based on this insight, we propose High-frequency Shuffle Data Augmentation (HSDA), a novel data augmentation strategy that enhances a network's ability to interpret high-frequency image content. This approach encourages the network to distinguish relevant high-frequency information from noise, leading to improved segmentation results for small and intricate image regions, as well as sharper edge and detail perception. Evaluated on the nuScenes dataset, our method demonstrates broad applicability across various BEV map segmentation networks, achieving a new state-of-the-art mean Intersection over Union (mIoU) of 61.3% for camera-only systems. This significant improvement underscores the potential of frequency domain data augmentation for advancing the field of autonomous driving perception. Code has been released: https://github.com/Zarhult/HSDA

Paper Structure

This paper contains 16 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Illustration of high-frequency spectrum, low-frequency spectrum and the corresponding images. (a) Original image. (b) High-frequency spectrum and corresponding image. The image becomes primarily dark but retains key edges and outlines. (c) Low-frequency spectrum and corresponding image. All sharp edges and rapid visual changes are removed, effectively blurring the image.
  • Figure 2: Overview of our baseline network architecture. It begins by processing multi-view camera images using an image encoder to extract features. These features are then transformed into the BEV space using a view transformation module that leverages camera intrinsics and extrinsics. Subsequently, a BEV encoder processes the transformed features, which are then passed to a segmentation head to generate the final BEV map segmentation predictions.
  • Figure 3: The proposed High-frequency Shuffle Data Augmentation (HSDA) method introduces perturbations in the high-frequency domain. HSDA operates on a randomly selected color channel, applying the Fast Fourier Transform (FFT) and filtering to obtain high-frequency and low-frequency components. The most salient $K$ frequencies within the high-frequency spectrum are shuffled to introduce controlled noise, which we emphasize in $\hat{A}^C$ for ease of visualization. Recombining with the original low-frequency spectrum and applying the inverse Fast Fourier Transform (iFFT) yields the augmented single-channel image. This replaces the original channel to generate the final augmented image. In this example, the green channel is randomly chosen from RGB channels for shuffling, causing the green color information in the final image to be perturbed. This creates a grid-like pattern of regions with excess or insufficient green intensity.
  • Figure 4: Illustration of one sample rainy scene. Red circles highlight differences between the predicted and ground truth segmentation.
  • Figure 5: Qualitative results are presented for daytime, rainy, and nighttime scenarios. The left panels display multi-view input images, while the right panels compare ground truth annotations (denoted as "GT") with the output of our proposed method, RGC+HSDA (denoted as "ours"). Six categories are annotated in the right panels: drivable area, pedestrian crossing, walkway, stop line, carpark area, and divider.