Light-weight Retinal Layer Segmentation with Global Reasoning
Xiang He, Weiye Song, Yiming Wang, Fabio Poiesi, Ji Yi, Manishi Desai, Quanqing Xu, Kongzheng Yang, Yi Wan
TL;DR
LightReSeg introduces a lightweight encoder-decoder architecture for retinal layer segmentation that integrates a Transformer-based global reasoning block at the deepest encoder scale with a multi-scale asymmetric attention (MAA) module for robust skip-feature fusion. The design employs depthwise separable and asymmetric convolutions to maintain a small parameter footprint while preserving segmentation accuracy. Across Vis-105H, Glaucoma, and DME datasets, LightReSeg achieves state-of-the-art performance in mIoU and mPA with only 3.3M parameters, and ablation studies confirm substantial gains from the MAA and Transformer components. The work emphasizes practical deployment in clinical OCT devices, highlighting both high accuracy and real-time inference potential, with plans to broaden datasets and improve domain generalization.
Abstract
Automatic retinal layer segmentation with medical images, such as optical coherence tomography (OCT) images, serves as an important tool for diagnosing ophthalmic diseases. However, it is challenging to achieve accurate segmentation due to low contrast and blood flow noises presented in the images. In addition, the algorithm should be light-weight to be deployed for practical clinical applications. Therefore, it is desired to design a light-weight network with high performance for retinal layer segmentation. In this paper, we propose LightReSeg for retinal layer segmentation which can be applied to OCT images. Specifically, our approach follows an encoder-decoder structure, where the encoder part employs multi-scale feature extraction and a Transformer block for fully exploiting the semantic information of feature maps at all scales and making the features have better global reasoning capabilities, while the decoder part, we design a multi-scale asymmetric attention (MAA) module for preserving the semantic information at each encoder scale. The experiments show that our approach achieves a better segmentation performance compared to the current state-of-the-art method TransUnet with 105.7M parameters on both our collected dataset and two other public datasets, with only 3.3M parameters.
