Table of Contents
Fetching ...

HySim-LLM: Embedding-Weighted Fine-Tuning Bounds and Manifold Denoising for Domain-Adapted LLMs

Majid Jaberi-Douraki, Hossein Sholehrasa, Xuan Xu, Remya Ampadi Ramachandran

TL;DR

HySim-LLM addresses domain shift in pharmacokinetic data by formalizing embedding-based similarity weighting and manifold-aware denoising for LLM fine-tuning. It derives a similarity-weighted generalization bound that bounds $L_T(\\theta_\\omega) - L_T(\\theta_0)$ in terms of $D_{\\chi}(p_T \\| p_S)$, embedding error $\\epsilon_{embed}$, and the optimization gap, and a manifold-denoising bound bounding the contribution of noisy samples by $O(\\sigma \\sqrt{d})$ using $d_{\\mathcal{M}}(\\tilde{\\mu}(x))$. The framework is integrated into HySim-LLM with a hybrid weight $\\omega_i^{hybrid} = \\omega_i \\omega_i^{clean}$ and validated conceptually on the AutoPK dataset. This work provides a principled, interpretable path for reliable, data-driven pharmacokinetic data extraction and broader structured biomedical applications.

Abstract

The extraction and standardization of pharmacokinetic (PK) information from scientific literature remain significant challenges in computational pharmacology, which limits the reliability of data-driven models in drug development. Large language models (LLMs) have achieved remarkable progress in text understanding and reasoning, yet their adaptation to structured biomedical data, such as PK tables, remains constrained by heterogeneity, noise, and domain shift. To address these limitations, we propose HySim-LLM, a unified mathematical and computational framework that integrates embedding-weighted fine-tuning and manifold-aware denoising to enhance the robustness and interpretability of LLMs. We establish two theoretical results: (1) a similarity-weighted generalization bound that quantifies adaptation performance under embedding divergence, and (2) a manifold-based denoising guarantee that bounds loss contributions from noisy or off-manifold samples. These theorems provide a principled foundation for fine-tuning LLMs in structured biomedical settings. The framework offers a mathematically grounded pathway toward reliable and interpretable LLM adaptation for biomedical and data-intensive scientific domains.

HySim-LLM: Embedding-Weighted Fine-Tuning Bounds and Manifold Denoising for Domain-Adapted LLMs

TL;DR

HySim-LLM addresses domain shift in pharmacokinetic data by formalizing embedding-based similarity weighting and manifold-aware denoising for LLM fine-tuning. It derives a similarity-weighted generalization bound that bounds in terms of , embedding error , and the optimization gap, and a manifold-denoising bound bounding the contribution of noisy samples by using . The framework is integrated into HySim-LLM with a hybrid weight and validated conceptually on the AutoPK dataset. This work provides a principled, interpretable path for reliable, data-driven pharmacokinetic data extraction and broader structured biomedical applications.

Abstract

The extraction and standardization of pharmacokinetic (PK) information from scientific literature remain significant challenges in computational pharmacology, which limits the reliability of data-driven models in drug development. Large language models (LLMs) have achieved remarkable progress in text understanding and reasoning, yet their adaptation to structured biomedical data, such as PK tables, remains constrained by heterogeneity, noise, and domain shift. To address these limitations, we propose HySim-LLM, a unified mathematical and computational framework that integrates embedding-weighted fine-tuning and manifold-aware denoising to enhance the robustness and interpretability of LLMs. We establish two theoretical results: (1) a similarity-weighted generalization bound that quantifies adaptation performance under embedding divergence, and (2) a manifold-based denoising guarantee that bounds loss contributions from noisy or off-manifold samples. These theorems provide a principled foundation for fine-tuning LLMs in structured biomedical settings. The framework offers a mathematically grounded pathway toward reliable and interpretable LLM adaptation for biomedical and data-intensive scientific domains.

Paper Structure

This paper contains 15 sections, 7 equations, 1 figure, 1 table, 2 algorithms.

Figures (1)

  • Figure 1: Comparison between the raw published PK table and its automatically extracted structured representation from the AutoPK dataset. This illustrates the transformation from unstructured scientific table formats into standardized, analysis-ready tabular data used for dataset curation and model evaluation.