HySim-LLM: Embedding-Weighted Fine-Tuning Bounds and Manifold Denoising for Domain-Adapted LLMs
Majid Jaberi-Douraki, Hossein Sholehrasa, Xuan Xu, Remya Ampadi Ramachandran
TL;DR
HySim-LLM addresses domain shift in pharmacokinetic data by formalizing embedding-based similarity weighting and manifold-aware denoising for LLM fine-tuning. It derives a similarity-weighted generalization bound that bounds $L_T(\\theta_\\omega) - L_T(\\theta_0)$ in terms of $D_{\\chi}(p_T \\| p_S)$, embedding error $\\epsilon_{embed}$, and the optimization gap, and a manifold-denoising bound bounding the contribution of noisy samples by $O(\\sigma \\sqrt{d})$ using $d_{\\mathcal{M}}(\\tilde{\\mu}(x))$. The framework is integrated into HySim-LLM with a hybrid weight $\\omega_i^{hybrid} = \\omega_i \\omega_i^{clean}$ and validated conceptually on the AutoPK dataset. This work provides a principled, interpretable path for reliable, data-driven pharmacokinetic data extraction and broader structured biomedical applications.
Abstract
The extraction and standardization of pharmacokinetic (PK) information from scientific literature remain significant challenges in computational pharmacology, which limits the reliability of data-driven models in drug development. Large language models (LLMs) have achieved remarkable progress in text understanding and reasoning, yet their adaptation to structured biomedical data, such as PK tables, remains constrained by heterogeneity, noise, and domain shift. To address these limitations, we propose HySim-LLM, a unified mathematical and computational framework that integrates embedding-weighted fine-tuning and manifold-aware denoising to enhance the robustness and interpretability of LLMs. We establish two theoretical results: (1) a similarity-weighted generalization bound that quantifies adaptation performance under embedding divergence, and (2) a manifold-based denoising guarantee that bounds loss contributions from noisy or off-manifold samples. These theorems provide a principled foundation for fine-tuning LLMs in structured biomedical settings. The framework offers a mathematically grounded pathway toward reliable and interpretable LLM adaptation for biomedical and data-intensive scientific domains.
