Unveiling LLM Mechanisms Through Neural ODEs and Control Theory
Yukun Zhang, Qi Dong
TL;DR
The paper tackles the interpretability gap in large language models by coupling Neural Ordinary Differential Equations with robust control theory to jointly model continuous-time latent dynamics and enforce output quality criteria. By mapping inputs and outputs into a latent space and integrating a control term, the framework yields more stable, interpretable predictions and provides a principled approach to controlling LLM behavior. Empirical results on diverse QA datasets show Neural ODEs effectively capture dynamic input-output transformations, while multiple control strategies improve consistency and reliability, especially in high-stakes tasks. This integrated approach advances explainable AI for LLMs, enabling transparent reasoning pathways and controllable generation across dynamic, real-world environments.
Abstract
This paper proposes a framework combining Neural Ordinary Differential Equations (Neural ODEs) and robust control theory to enhance the interpretability and control of large language models (LLMs). By utilizing Neural ODEs to model the dynamic evolution of input-output relationships and introducing control mechanisms to optimize output quality, we demonstrate the effectiveness of this approach across multiple question-answer datasets. Experimental results show that the integration of Neural ODEs and control theory significantly improves output consistency and model interpretability, advancing the development of explainable AI technologies.
