Table of Contents
Fetching ...

DualStreamFoveaNet: A Dual Stream Fusion Architecture with Anatomical Awareness for Robust Fovea Localization

Sifan Song, Jinfeng Wang, Zilong Wang, Hongxing Wang, Jionglong Su, Xiaowei Ding, Kang Dang

TL;DR

A novel transformer-based architecture called DualStreamFoveaNet (DSFN) for multi-cue fusion that explicitly incorporates long-range connections and global features using retina and vessel distributions for robust fovea localization and reduces computational costs by decreasing token numbers.

Abstract

Accurate fovea localization is essential for analyzing retinal diseases to prevent irreversible vision loss. While current deep learning-based methods outperform traditional ones, they still face challenges such as the lack of local anatomical landmarks around the fovea, the inability to robustly handle diseased retinal images, and the variations in image conditions. In this paper, we propose a novel transformer-based architecture called DualStreamFoveaNet (DSFN) for multi-cue fusion. This architecture explicitly incorporates long-range connections and global features using retina and vessel distributions for robust fovea localization. We introduce a spatial attention mechanism in the dual-stream encoder to extract and fuse self-learned anatomical information, focusing more on features distributed along blood vessels and significantly reducing computational costs by decreasing token numbers. Our extensive experiments show that the proposed architecture achieves state-of-the-art performance on two public datasets and one large-scale private dataset. Furthermore, we demonstrate that the DSFN is more robust on both normal and diseased retina images and has better generalization capacity in cross-dataset experiments.

DualStreamFoveaNet: A Dual Stream Fusion Architecture with Anatomical Awareness for Robust Fovea Localization

TL;DR

A novel transformer-based architecture called DualStreamFoveaNet (DSFN) for multi-cue fusion that explicitly incorporates long-range connections and global features using retina and vessel distributions for robust fovea localization and reduces computational costs by decreasing token numbers.

Abstract

Accurate fovea localization is essential for analyzing retinal diseases to prevent irreversible vision loss. While current deep learning-based methods outperform traditional ones, they still face challenges such as the lack of local anatomical landmarks around the fovea, the inability to robustly handle diseased retinal images, and the variations in image conditions. In this paper, we propose a novel transformer-based architecture called DualStreamFoveaNet (DSFN) for multi-cue fusion. This architecture explicitly incorporates long-range connections and global features using retina and vessel distributions for robust fovea localization. We introduce a spatial attention mechanism in the dual-stream encoder to extract and fuse self-learned anatomical information, focusing more on features distributed along blood vessels and significantly reducing computational costs by decreasing token numbers. Our extensive experiments show that the proposed architecture achieves state-of-the-art performance on two public datasets and one large-scale private dataset. Furthermore, we demonstrate that the DSFN is more robust on both normal and diseased retina images and has better generalization capacity in cross-dataset experiments.
Paper Structure (24 sections, 9 equations, 7 figures, 7 tables)

This paper contains 24 sections, 9 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustration of the shared global anatomical relationships among the fovea, optic disc, and main blood vessels, in normal (a),diseased (b, c), and poorly conditioned (d) retina images.
  • Figure 2: The overall architecture of our proposed DualStreamFoveaNet (DSFN) network. It consists of a two-stream encoder which effectively incorporates long-range features from both fundus and vessel distributions, and a decoder that generates accurate segmentation results by effective multi-scale feature incorporation.
  • Figure 3: The structure of BTI module used in our DualStreamFoveaNet (DSFN). It contains TokenLearner, $t\times$ MHSA layers and TokenFuser. The $h$, $w$ and $c$ are height, width, and channel of the corresponding input features. The $n$ represents the number of learned tokens.
  • Figure 4: Visualization of mean errors ($Y$-axis) of different multi-cue fusion models. $X$-axis is the computational cost (GFLOPs). The red and blue markers are results on PALM and Tisu datasets, respectively. Numbers below the markers are corresponding mean errors.
  • Figure 5: Comparing visual results of fovea localization predictions by various methods.
  • ...and 2 more figures