Fine-Tune Language Models as Multi-Modal Differential Equation Solvers
Liu Yang, Siting Liu, Stanley J. Osher
TL;DR
This work reframes in-context operator learning as a multi-modal task by introducing captions that encode human knowledge about operators, and presents ICON-LM, a language-model–like transformer trained with a next-function prediction objective to map any condition $C$ to a QoI $Q$ across multiple prompts. Compared to the baseline encoder–decoder ICON and classic operator learners like FNO/DeepONet, ICON-LM achieves superior data efficiency and generalization, especially in few-shot settings, and benefits further from caption modalities, with precise captions yielding the strongest gains. The approach enables end-to-end fine-tuning of language models for scientific differential equation solvers, reduces the need for bespoke losses or architectures, and broadens the application of LM ecosystems to heavy numerical computation tasks. The results demonstrate that human-guided, multi-modal inputs can significantly improve operator learning under limited data, suggesting a scalable path for integrating domain knowledge into scientific ML systems.
Abstract
In the growing domain of scientific machine learning, in-context operator learning has shown notable potential in building foundation models, as in this framework the model is trained to learn operators and solve differential equations using prompted data, during the inference stage without weight updates. However, the current model's overdependence on function data overlooks the invaluable human insight into the operator. To address this, we present a transformation of in-context operator learning into a multi-modal paradigm. In particular, we take inspiration from the recent success of large language models, and propose using "captions" to integrate human knowledge about the operator, expressed through natural language descriptions and equations. Also, we introduce a novel approach to train a language-model-like architecture, or directly fine-tune existing language models, for in-context operator learning. We beat the baseline on single-modal learning tasks, and also demonstrated the effectiveness of multi-modal learning in enhancing performance and reducing function data requirements. The proposed method not only significantly enhanced the development of the in-context operator learning paradigm, but also created a new path for the application of language models.
