Table of Contents
Fetching ...

A Hybrid Framework Bridging CNN and ViT based on Theory of Evidence for Diabetic Retinopathy Grading

Junlai Qiu, Yunzhu Chen, Hao Zheng, Yawen Huang, Yuexiang Li

TL;DR

The paper tackles DR grading bottlenecks by introducing a CNN–ViT hybrid framework guided by evidence theory. It converts backbone features into evidences and opinions using Dirichlet-based representations and fuses them with trusted evidence rules to produce robust, uncertainty-aware predictions. The approach achieves state-of-the-art results on APTOS and DRTiD DR datasets and demonstrates generalization to other medical imaging tasks, such as histopathology, with improved interpretability. Overall, the framework offers a principled, scalable method for multi-backbone fusion that leverages complementary strengths while explicitly modeling uncertainty to aid clinical decision-making.

Abstract

Diabetic retinopathy (DR) is a leading cause of vision loss among middle-aged and elderly people, which significantly impacts their daily lives and mental health. To improve the efficiency of clinical screening and enable the early detection of DR, a variety of automated DR diagnosis systems have been recently established based on convolutional neural network (CNN) or vision Transformer (ViT). However, due to the own shortages of CNN / ViT, the performance of existing methods using single-type backbone has reached a bottleneck. One potential way for the further improvements is integrating different kinds of backbones, which can fully leverage the respective strengths of them (\emph{i.e.,} the local feature extraction capability of CNN and the global feature capturing ability of ViT). To this end, we propose a novel paradigm to effectively fuse the features extracted by different backbones based on the theory of evidence. Specifically, the proposed evidential fusion paradigm transforms the features from different backbones into supporting evidences via a set of deep evidential networks. With the supporting evidences, the aggregated opinion can be accordingly formed, which can be used to adaptively tune the fusion pattern between different backbones and accordingly boost the performance of our hybrid model. We evaluated our method on two publicly available DR grading datasets. The experimental results demonstrate that our hybrid model not only improves the accuracy of DR grading, compared to the state-of-the-art frameworks, but also provides the excellent interpretability for feature fusion and decision-making.

A Hybrid Framework Bridging CNN and ViT based on Theory of Evidence for Diabetic Retinopathy Grading

TL;DR

The paper tackles DR grading bottlenecks by introducing a CNN–ViT hybrid framework guided by evidence theory. It converts backbone features into evidences and opinions using Dirichlet-based representations and fuses them with trusted evidence rules to produce robust, uncertainty-aware predictions. The approach achieves state-of-the-art results on APTOS and DRTiD DR datasets and demonstrates generalization to other medical imaging tasks, such as histopathology, with improved interpretability. Overall, the framework offers a principled, scalable method for multi-backbone fusion that leverages complementary strengths while explicitly modeling uncertainty to aid clinical decision-making.

Abstract

Diabetic retinopathy (DR) is a leading cause of vision loss among middle-aged and elderly people, which significantly impacts their daily lives and mental health. To improve the efficiency of clinical screening and enable the early detection of DR, a variety of automated DR diagnosis systems have been recently established based on convolutional neural network (CNN) or vision Transformer (ViT). However, due to the own shortages of CNN / ViT, the performance of existing methods using single-type backbone has reached a bottleneck. One potential way for the further improvements is integrating different kinds of backbones, which can fully leverage the respective strengths of them (\emph{i.e.,} the local feature extraction capability of CNN and the global feature capturing ability of ViT). To this end, we propose a novel paradigm to effectively fuse the features extracted by different backbones based on the theory of evidence. Specifically, the proposed evidential fusion paradigm transforms the features from different backbones into supporting evidences via a set of deep evidential networks. With the supporting evidences, the aggregated opinion can be accordingly formed, which can be used to adaptively tune the fusion pattern between different backbones and accordingly boost the performance of our hybrid model. We evaluated our method on two publicly available DR grading datasets. The experimental results demonstrate that our hybrid model not only improves the accuracy of DR grading, compared to the state-of-the-art frameworks, but also provides the excellent interpretability for feature fusion and decision-making.

Paper Structure

This paper contains 12 sections, 8 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The overall flowchart of proposed framework of multi-backbone fusion based on evidence theory. Evidences and opinions are constructed based on the features extracted by different stages of CNN and ViT, which are then adopted for feature fusion. The fusion of last two stages of CNN and ViT is taken as an example for illustration.
  • Figure 2: Density of uncertainty of features yielded by different stages on APTOS.