A Novel Hierarchical Integration Method for Efficient Model Merging in Medical LLMs
Prakrit Timilsina, Anuj Nepal, Rajan Kadel, Robin Doss
TL;DR
This work addresses the challenge of efficiently consolidating medical expertise across distributed edge settings by evaluating six parameter-space merging techniques on architecturally compatible medical LLMs. It introduces a novel Hierarchical Cosine-OT-LERP method that combines task-vector similarity with selective attention-head alignment to mitigate permutation variance while preserving edge-deployment efficiency. Across five medical benchmarks, simple merging methods—especially Task Arithmetic and Linear Averaging—consistently outperform complex approaches, achieving up to 45.80% accuracy on MedQA and often surpassing the base model on QA tasks. The findings suggest a practical path for privacy-preserving, scalable medical AI in IoT-enabled environments, highlighting compatibility-aware design and favoring lightweight merging baselines over retraining in resource-constrained settings.
Abstract
Large Language Models (LLMs) face significant challenges in distributed healthcare, including consolidating specialized domain knowledge across institutions while maintaining privacy, reducing computational overhead, and preventing catastrophic forgetting during model updates.This paper presents a systematic evaluation of six parameter-space merging techniques applied to two architecturally compatible medical LLMs derived from the Mistral-7B base model. We introduce a novel hierarchical method that combines selective Optimal Transport (OT) alignment for attention layers with cosine similarity-weighted interpolation, designed to address permutation variance while minimizing computational overhead for edge deployment scenarios. Our study evaluates Task Arithmetic, Linear Averaging, DARE-TIES, DELLA, Breadcrumbs, and our Hierarchical approach across five medical benchmarks. Results demonstrate that architecturally compatible models benefit significantly from simple averaging methods, with Task Arithmetic achieving 45.80% accuracy on MedQA, outperforming complex pruning-based approaches. These findings offer critical insights for the deployment of distributed medical AI in resource-constrained IoT environments, where computational efficiency and model compatibility are paramount. Our work establishes that for architecturally compatible models, simple averaging provides a robust and computationally efficient baseline for knowledge consolidation, offering a pragmatic path forward for scalable medical AI systems.
