Table of Contents
Fetching ...

Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization

Ayan Banerjee, Kuntal Thakur, Sandeep Gupta

Abstract

Generalizing image classification across domains remains challenging in critical tasks such as fundus image-based diabetic retinopathy (DR) grading and resting-state fMRI seizure onset zone (SOZ) detection. When domains differ in unknown causal factors, achieving cross-domain generalization is difficult, and there is no established methodology to objectively assess such differences without direct metadata or protocol-level information from data collectors, which is typically inaccessible. We first introduce domain conformal bounds (DCB), a theoretical framework to evaluate whether domains diverge in unknown causal factors. Building on this, we propose GenEval, a multimodal Vision Language Models (VLM) approach that combines foundational models (e.g., MedGemma-4B) with human knowledge via Low-Rank Adaptation (LoRA) to bridge causal gaps and enhance single-source domain generalization (SDG). Across eight DR and two SOZ datasets, GenEval achieves superior SDG performance, with average accuracy of 69.2% (DR) and 81% (SOZ), outperforming the strongest baselines by 9.4% and 1.8%, respectively.

Human Knowledge Integrated Multi-modal Learning for Single Source Domain Generalization

Abstract

Generalizing image classification across domains remains challenging in critical tasks such as fundus image-based diabetic retinopathy (DR) grading and resting-state fMRI seizure onset zone (SOZ) detection. When domains differ in unknown causal factors, achieving cross-domain generalization is difficult, and there is no established methodology to objectively assess such differences without direct metadata or protocol-level information from data collectors, which is typically inaccessible. We first introduce domain conformal bounds (DCB), a theoretical framework to evaluate whether domains diverge in unknown causal factors. Building on this, we propose GenEval, a multimodal Vision Language Models (VLM) approach that combines foundational models (e.g., MedGemma-4B) with human knowledge via Low-Rank Adaptation (LoRA) to bridge causal gaps and enhance single-source domain generalization (SDG). Across eight DR and two SOZ datasets, GenEval achieves superior SDG performance, with average accuracy of 69.2% (DR) and 81% (SOZ), outperforming the strongest baselines by 9.4% and 1.8%, respectively.
Paper Structure (34 sections, 3 theorems, 13 equations, 4 figures, 17 tables, 2 algorithms)

This paper contains 34 sections, 3 theorems, 13 equations, 4 figures, 17 tables, 2 algorithms.

Key Result

Theorem 1

For any data point $X,Y \in D^s$, $Pr(\rho(\mathcal{K}(X),D^s) \in C = [\sigma_C- d, \sigma_C+d] ) \geq 1 - \alpha$, $\alpha > 0$ if and only if $Pr(\mathcal{K}(X)|\{X\}\in D^s) = Pr(\mathcal{K}(X)|\{X\}\in D^t)$, where $d$ is given by Algorithm alg:Prop. (Proof in supplement)

Figures (4)

  • Figure 1: SDG attempts for DR has traditionally failed to consistently outperform ERM primarily due to existing gap in causal factors among data sets. Knowledge from human experts in the field can fill the causal gap but are usually qualitative and ambiguous. This paper demonstrates GenEval that quantifies human expert knowledge, refines them, and integrates specialized foundational model such as MedGemma-4B through LoRA based fine tuning to bridge the causal gap between domains while removing ambiguity of knowledge.
  • Figure 2: Contributions: a) Algorithms \ref{['alg:Prop']} and \ref{['alg:U2']} to obtain domain conformal bounds and source domain conformance degree. Utilizing SDCD, ablation studies are performed to refine the knowledge set into an optimal set that maximizes SDCD. b) GenEval, a specialized foundational vision language model based method that integrates image and expert knowledge in textual form for a multi-modal prompt driven DR grading and SOZ detection schemes.
  • Figure 3: Variation of SDCD metric with pixel signal to noise ratio due to instability in Mahalanobis distance, and effect on correlation of SDCD with Accuracy given by Lemma \ref{['lem:SDCD']}.
  • Figure 4: MedGemma-4B LoRA fine-tuning overview.

Theorems & Definitions (3)

  • Theorem 1
  • Lemma 1
  • Lemma 2