Table of Contents
Fetching ...

LeMoRe: Learn More Details for Lightweight Semantic Segmentation

Mian Muhammad Naeem Abid, Nancy Mehta, Zongwei Wu, Radu Timofte

TL;DR

LeMoRe tackles the efficiency–accuracy trade-off in semantic segmentation by integrating explicit Cartesian views with implicitly learned views via Nested Attention. It introduces three components—Cartesian Encoder, Nested Attention, and a Gated Fusion Module—to enable multiview feature modeling with reduced computation and memory. Across ADE20K, CityScapes, PASCAL Context, and COCO-Stuff, LeMoRe delivers competitive accuracy while achieving substantial GFLOPs and parameter reductions, outperforming many lightweight baselines. This explicit–implicit multiview approach offers a practical path toward real-time segmentation on resource-constrained devices.

Abstract

Lightweight semantic segmentation is essential for many downstream vision tasks. Unfortunately, existing methods often struggle to balance efficiency and performance due to the complexity of feature modeling. Many of these existing approaches are constrained by rigid architectures and implicit representation learning, often characterized by parameter-heavy designs and a reliance on computationally intensive Vision Transformer-based frameworks. In this work, we introduce an efficient paradigm by synergizing explicit and implicit modeling to balance computational efficiency with representational fidelity. Our method combines well-defined Cartesian directions with explicitly modeled views and implicitly inferred intermediate representations, efficiently capturing global dependencies through a nested attention mechanism. Extensive experiments on challenging datasets, including ADE20K, CityScapes, Pascal Context, and COCO-Stuff, demonstrate that LeMoRe strikes an effective balance between performance and efficiency.

LeMoRe: Learn More Details for Lightweight Semantic Segmentation

TL;DR

LeMoRe tackles the efficiency–accuracy trade-off in semantic segmentation by integrating explicit Cartesian views with implicitly learned views via Nested Attention. It introduces three components—Cartesian Encoder, Nested Attention, and a Gated Fusion Module—to enable multiview feature modeling with reduced computation and memory. Across ADE20K, CityScapes, PASCAL Context, and COCO-Stuff, LeMoRe delivers competitive accuracy while achieving substantial GFLOPs and parameter reductions, outperforming many lightweight baselines. This explicit–implicit multiview approach offers a practical path toward real-time segmentation on resource-constrained devices.

Abstract

Lightweight semantic segmentation is essential for many downstream vision tasks. Unfortunately, existing methods often struggle to balance efficiency and performance due to the complexity of feature modeling. Many of these existing approaches are constrained by rigid architectures and implicit representation learning, often characterized by parameter-heavy designs and a reliance on computationally intensive Vision Transformer-based frameworks. In this work, we introduce an efficient paradigm by synergizing explicit and implicit modeling to balance computational efficiency with representational fidelity. Our method combines well-defined Cartesian directions with explicitly modeled views and implicitly inferred intermediate representations, efficiently capturing global dependencies through a nested attention mechanism. Extensive experiments on challenging datasets, including ADE20K, CityScapes, Pascal Context, and COCO-Stuff, demonstrate that LeMoRe strikes an effective balance between performance and efficiency.

Paper Structure

This paper contains 12 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: (a) We propose replacing costly feature modeling with interpretable projections into lower-dimensional spaces for improved efficiency. These projections or views capture various feature aspects. Explicit views align with predefined directions like Cartesian coordinates, while implicit views represent intermediate positions learned through nested attention for global dependencies. (b) Our method performs better than counterpart solutions with less computational cost (\ref{['sec:results_ade20K']}).
  • Figure 2: Architecture Overview: (a) shows the proposed architecture of LeMoRe, highlighting how our method balances efficiency and performance through multiview modeling. (b) depicts the Cartesian Encoder, which enhances contextual understanding and feature richness by extracting explicit views in three dimensions using well-defined Cartesian directions. (c) illustrates the Nested Attention mechanism, which enriches the feature modeling by learning complex relationships within the data through implicit views, maintaining low computational cost via efficient attention over each query-key pair. (d) presents the GFM module, which dynamically fuses global and local features to improve segmentation performance.
  • Figure 3: The visualization of Image, Ground Truth, SegFormer and LeMoRe results on the ADE20K validation set. The results highlight the proposed model's effectiveness in producing high-quality segmentation maps with improved spatial consistency.
  • Figure 4: Design of Feed-Forward Network.
  • Figure 5: Visualization of Image, Ground Truth, SegFormer, and LeMoRe results on the ADE20K validation set highlights the proposed model's effectiveness in producing high-quality segmentation maps with enhanced spatial consistency.