Table of Contents
Fetching ...

Lightweight Road Environment Segmentation using Vector Quantization

Jiyong Kwag, Alper Yilmaz, Charles Toth

TL;DR

This work tackles efficient road environment semantic segmentation for autonomous driving by replacing continuous encoder features with discrete latent representations via vector quantization. It couples a vector quantization layer to a lightweight MobileUNETR encoder–decoder, training with a combination of segmentation loss and VQ losses to learn both the encoder and the codebook. On Cityscapes, the approach achieves 77.0% mIoU, a 2.9-point improvement over the MobileUNETR baseline, while preserving the original model size and compute, and it outperforms SegFormer B0 as well. The results demonstrate that discrete latent representations can enhance segmentation precision and edge-detail without sacrificing efficiency, making the approach suitable for real-time autonomous driving systems.

Abstract

Road environment segmentation plays a significant role in autonomous driving. Numerous works based on Fully Convolutional Networks (FCNs) and Transformer architectures have been proposed to leverage local and global contextual learning for efficient and accurate semantic segmentation. In both architectures, the encoder often relies heavily on extracting continuous representations from the image, which limits the ability to represent meaningful discrete information. To address this limitation, we propose segmentation of the autonomous driving environment using vector quantization. Vector quantization offers three primary advantages for road environment segmentation. (1) Each continuous feature from the encoder is mapped to a discrete vector from the codebook, helping the model discover distinct features more easily than with complex continuous features. (2) Since a discrete feature acts as compressed versions of the encoder's continuous features, they also compress noise or outliers, enhancing the image segmentation task. (3) Vector quantization encourages the latent space to form coarse clusters of continuous features, forcing the model to group similar features, making the learned representations more structured for the decoding process. In this work, we combined vector quantization with the lightweight image segmentation model MobileUNETR and used it as a baseline model for comparison to demonstrate its efficiency. Through experiments, we achieved 77.0 % mIoU on Cityscapes, outperforming the baseline by 2.9 % without increasing the model's initial size or complexity.

Lightweight Road Environment Segmentation using Vector Quantization

TL;DR

This work tackles efficient road environment semantic segmentation for autonomous driving by replacing continuous encoder features with discrete latent representations via vector quantization. It couples a vector quantization layer to a lightweight MobileUNETR encoder–decoder, training with a combination of segmentation loss and VQ losses to learn both the encoder and the codebook. On Cityscapes, the approach achieves 77.0% mIoU, a 2.9-point improvement over the MobileUNETR baseline, while preserving the original model size and compute, and it outperforms SegFormer B0 as well. The results demonstrate that discrete latent representations can enhance segmentation precision and edge-detail without sacrificing efficiency, making the approach suitable for real-time autonomous driving systems.

Abstract

Road environment segmentation plays a significant role in autonomous driving. Numerous works based on Fully Convolutional Networks (FCNs) and Transformer architectures have been proposed to leverage local and global contextual learning for efficient and accurate semantic segmentation. In both architectures, the encoder often relies heavily on extracting continuous representations from the image, which limits the ability to represent meaningful discrete information. To address this limitation, we propose segmentation of the autonomous driving environment using vector quantization. Vector quantization offers three primary advantages for road environment segmentation. (1) Each continuous feature from the encoder is mapped to a discrete vector from the codebook, helping the model discover distinct features more easily than with complex continuous features. (2) Since a discrete feature acts as compressed versions of the encoder's continuous features, they also compress noise or outliers, enhancing the image segmentation task. (3) Vector quantization encourages the latent space to form coarse clusters of continuous features, forcing the model to group similar features, making the learned representations more structured for the decoding process. In this work, we combined vector quantization with the lightweight image segmentation model MobileUNETR and used it as a baseline model for comparison to demonstrate its efficiency. Through experiments, we achieved 77.0 % mIoU on Cityscapes, outperforming the baseline by 2.9 % without increasing the model's initial size or complexity.

Paper Structure

This paper contains 12 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Per-category segmentation IoU results on the Cityscapes validation set. The graph displays the IoU evaluation for each category among SegFormer B0, MobileUNETR, and our proposed model. The results indicate that our architecture achieves better performance than the baseline models across 14 categories.
  • Figure 2: The proposed architecture combines MobileUNETR with vector quantization to transform the continuous representations extracted by the encoder into discrete representations. After the vector quantization layer, these discrete representations are processed by the MobileUNETR decoder to complete the segmentation task.
  • Figure 3: (Top) The MobileUNETR encoder utilizes a pretrained MobileViT encoder, composed of MobileNet V2 blocks and Transformer encoder blocks, for efficient feature extraction. (Bottom) The MobileUNETR decoder mirrors the encoder structure, but instead upsamples the features from the vector quantization layer for semantic segmentation.
  • Figure 4: Simplified representation of the vector quantization layer: each continuous representation from the encoder is mapped to the nearest predefined discrete feature from the codebook, which is then used as input to the decoder.
  • Figure 5: Visualization results on Cityscapes. Compared to the baseline model MobileUNETR (Right), our model (Center) predicts segmentation with more precise object edges. We also provide a comparison with SegFormer B0 (Left), showing that, despite a smaller size and lower FLOPs, vector quantization enhances performance over the baseline and outperforms SegFormer B0.