Table of Contents
Fetching ...

MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

Zilong Zhao, Zhengming Ding, Pei Niu, Wenhao Sun, Feng Guo

TL;DR

This work presents MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder.

Abstract

Feature encoders play a key role in pixel-level crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba's latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a spatial block processing strategy and a Direction-guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi-Level Fusion (SRF) module is then employed to refine multi-scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state-of-the-art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability. The code is available at https://github.com/spiderforest/MixerCSeg.

MixerCSeg: An Efficient Mixer Architecture for Crack Segmentation via Decoupled Mamba Attention

TL;DR

This work presents MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder.

Abstract

Feature encoders play a key role in pixel-level crack segmentation by shaping the representation of fine textures and thin structures. Existing CNN-, Transformer-, and Mamba-based models each capture only part of the required spatial or structural information, leaving clear gaps in modeling complex crack patterns. To address this, we present MixerCSeg, a mixer architecture designed like a coordinated team of specialists, where CNN-like pathways focus on local textures, Transformer-style paths capture global dependencies, and Mamba-inspired flows model sequential context within a single encoder. At the core of MixerCSeg is the TransMixer, which explores Mamba's latent attention behavior while establishing dedicated pathways that naturally express both locality and global awareness. To further enhance structural fidelity, we introduce a spatial block processing strategy and a Direction-guided Edge Gated Convolution (DEGConv) that strengthens edge sensitivity under irregular crack geometries with minimal computational overhead. A Spatial Refinement Multi-Level Fusion (SRF) module is then employed to refine multi-scale details without increasing complexity. Extensive experiments on multiple crack segmentation benchmarks show that MixerCSeg achieves state-of-the-art performance with only 2.05 GFLOPs and 2.54 M parameters, demonstrating both efficiency and strong representational capability. The code is available at https://github.com/spiderforest/MixerCSeg.
Paper Structure (24 sections, 12 equations, 6 figures, 11 tables)

This paper contains 24 sections, 12 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Visualization results of heatmaps and attention maps in VMamba vmamba. Benefiting from the multi-directional scanning mechanism, tokens in global channels can achieve a receptive field that covers the entire scope, rather than being limited to historical context. In contrast, local channels focus on the field of view of neighboring regions.
  • Figure 2: Overview of the proposed method. (a) illustrates the overall architecture of MixerCSeg. TransMixer blocks and downsampling operators are connected in series to extract multi-scale features. Before decoding, the DEGConv module is employed to enhance crack texture details. (b) demonstrates the design of the DEGConv, which integrates a spatial block strategy and directional prior knowledge.
  • Figure 3: The difference between TransMixer and existing methods. (a) illustrates two common patterns in existing methods that hybridize Mamba blocks and Transformer blocks. (b) shows the design details of the TransMixer module, which decomposes features into global and local representations, enhancing them with Self-Attention and Local Refinement, respectively.
  • Figure 4: Visualization results of MixerCSeg versus state-of-the-art methods in various environments.
  • Figure 5: Representative crack images from the Crack500 dataset.
  • ...and 1 more figures