A Hybrid Framework Bridging CNN and ViT based on Theory of Evidence for Diabetic Retinopathy Grading
Junlai Qiu, Yunzhu Chen, Hao Zheng, Yawen Huang, Yuexiang Li
TL;DR
The paper tackles DR grading bottlenecks by introducing a CNN–ViT hybrid framework guided by evidence theory. It converts backbone features into evidences and opinions using Dirichlet-based representations and fuses them with trusted evidence rules to produce robust, uncertainty-aware predictions. The approach achieves state-of-the-art results on APTOS and DRTiD DR datasets and demonstrates generalization to other medical imaging tasks, such as histopathology, with improved interpretability. Overall, the framework offers a principled, scalable method for multi-backbone fusion that leverages complementary strengths while explicitly modeling uncertainty to aid clinical decision-making.
Abstract
Diabetic retinopathy (DR) is a leading cause of vision loss among middle-aged and elderly people, which significantly impacts their daily lives and mental health. To improve the efficiency of clinical screening and enable the early detection of DR, a variety of automated DR diagnosis systems have been recently established based on convolutional neural network (CNN) or vision Transformer (ViT). However, due to the own shortages of CNN / ViT, the performance of existing methods using single-type backbone has reached a bottleneck. One potential way for the further improvements is integrating different kinds of backbones, which can fully leverage the respective strengths of them (\emph{i.e.,} the local feature extraction capability of CNN and the global feature capturing ability of ViT). To this end, we propose a novel paradigm to effectively fuse the features extracted by different backbones based on the theory of evidence. Specifically, the proposed evidential fusion paradigm transforms the features from different backbones into supporting evidences via a set of deep evidential networks. With the supporting evidences, the aggregated opinion can be accordingly formed, which can be used to adaptively tune the fusion pattern between different backbones and accordingly boost the performance of our hybrid model. We evaluated our method on two publicly available DR grading datasets. The experimental results demonstrate that our hybrid model not only improves the accuracy of DR grading, compared to the state-of-the-art frameworks, but also provides the excellent interpretability for feature fusion and decision-making.
