CeViT: Copula-Enhanced Vision Transformer in multi-task learning and bi-group image covariates with an application to myopia screening

Chong Zhong; Yang Li; Jinfeng Xu; Xiang Fu; Yunhao Liu; Qiuyi Huang; Danjuan Yang; Meiyan Li; Aiyi Liu; Alan H. Welsh; Xingtao Zhou; Bo Fu; Catherine C. Liu

CeViT: Copula-Enhanced Vision Transformer in multi-task learning and bi-group image covariates with an application to myopia screening

Chong Zhong, Yang Li, Jinfeng Xu, Xiang Fu, Yunhao Liu, Qiuyi Huang, Danjuan Yang, Meiyan Li, Aiyi Liu, Alan H. Welsh, Xingtao Zhou, Bo Fu, Catherine C. Liu

TL;DR

CeViT introduces a copula-enhanced Vision Transformer to jointly predict high-myopia status and axial length from paired ultrawide-field fundus images. The model uses a bi-channel ViT with a shared encoder and eye-specific MLP heads, paired with a 4-dimensional mixed copula loss to capture conditional dependence among discrete-continuous responses across eyes. The authors provide a latent-representation-based theoretical framework, proving consistency of copula parameter estimates and asymptotic equivalence to MLE, along with a relative-efficiency gain over empirical loss. Empirical results on a real UWF fundus dataset show improved AL prediction and competitive high-myopia classification, while simulations corroborate the method’s robustness and generalization. Overall, CeViT offers a scalable, interpretable approach to multi-task myopia screening that leverages interocular information and latent dependence structures, with practical implications for transfer learning and efficient fine-tuning of large Vision Transformers.

Abstract

We aim to assist image-based myopia screening by resolving two longstanding problems, "how to integrate the information of ocular images of a pair of eyes" and "how to incorporate the inherent dependence among high-myopia status and axial length for both eyes." The classification-regression task is modeled as a novel 4-dimensional muti-response regression, where discrete responses are allowed, that relates to two dependent 3rd-order tensors (3D ultrawide-field fundus images). We present a Vision Transformer-based bi-channel architecture, named CeViT, where the common features of a pair of eyes are extracted via a shared Transformer encoder, and the interocular asymmetries are modeled through separated multilayer perceptron heads. Statistically, we model the conditional dependence among mixture of discrete-continuous responses given the image covariates by a so-called copula loss. We establish a new theoretical framework regarding fine-tuning on CeViT based on latent representations, allowing the black-box fine-tuning procedure interpretable and guaranteeing higher relative efficiency of fine-tuning weight estimation in the asymptotic setting. We apply CeViT to an annotated ultrawide-field fundus image dataset collected by Shanghai Eye \& ENT Hospital, demonstrating that CeViT enhances the baseline model in both accuracy of classifying high-myopia and prediction of AL on both eyes.

CeViT: Copula-Enhanced Vision Transformer in multi-task learning and bi-group image covariates with an application to myopia screening

TL;DR

Abstract

Paper Structure (11 sections, 5 theorems, 23 equations, 4 figures, 1 table)

This paper contains 11 sections, 5 theorems, 23 equations, 4 figures, 1 table.

Introduction
CeViT: architecture, loss, and algorithm
Bi-channel ViT architecture
Four-dim mixed copula loss
Feasible CeViT algorithm
Theory
Illustrative data generation model
Statistical inference of fine-tuning on CeViT
Application on the UWF fundus image dataset
Simulation
Discussion

Key Result

Theorem 1

Under mean regression basic model and marginal model marginal distribution, the joint log density of $\bm{y}|(\mathcal{X}_1, \mathcal{X}_2)$ is where $\widetilde{\bm{\mu}} = (\Gamma_{21}\Gamma_{11}^{-1} (\sigma_1^{-1}(y_1 - \mu_1), \sigma_2^{-1}(y_3 - \mu_2))^{\mathsf{T}}) := (\widetilde{\mu}_1, \widetilde{\mu}_2)^{\mathsf{T}}$, $\widetilde{V} = \Gamma_{22} - \Gamma_{21} \Gamma_{11}^{-1} \Gamma_

Figures (4)

Figure 1: The bi-channel architecture of proposed CeViT.
Figure 2: Boxplots of evaluation metrics among 4 runs of 5-fold CV on the UWF fundus image dataset. The average metrics are labeled as polylines.
Figure 3: Win count of CeViT and CeViT-A compared with the baseline ViT on different evaluation metrics.
Figure 4: Boxplots of evaluation metrics on synthetic datasets. The average metrics are labeled as polylines.

Theorems & Definitions (5)

Theorem 1
Lemma 1: Latent sufficient representation
Theorem 2: Copula consistency
Theorem 3: Asymptotic MLE equivalence
Theorem 4: Relative efficiency

CeViT: Copula-Enhanced Vision Transformer in multi-task learning and bi-group image covariates with an application to myopia screening

TL;DR

Abstract

CeViT: Copula-Enhanced Vision Transformer in multi-task learning and bi-group image covariates with an application to myopia screening

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (5)