Table of Contents
Fetching ...

Fine-Tune Language Models as Multi-Modal Differential Equation Solvers

Liu Yang, Siting Liu, Stanley J. Osher

TL;DR

This work reframes in-context operator learning as a multi-modal task by introducing captions that encode human knowledge about operators, and presents ICON-LM, a language-model–like transformer trained with a next-function prediction objective to map any condition $C$ to a QoI $Q$ across multiple prompts. Compared to the baseline encoder–decoder ICON and classic operator learners like FNO/DeepONet, ICON-LM achieves superior data efficiency and generalization, especially in few-shot settings, and benefits further from caption modalities, with precise captions yielding the strongest gains. The approach enables end-to-end fine-tuning of language models for scientific differential equation solvers, reduces the need for bespoke losses or architectures, and broadens the application of LM ecosystems to heavy numerical computation tasks. The results demonstrate that human-guided, multi-modal inputs can significantly improve operator learning under limited data, suggesting a scalable path for integrating domain knowledge into scientific ML systems.

Abstract

In the growing domain of scientific machine learning, in-context operator learning has shown notable potential in building foundation models, as in this framework the model is trained to learn operators and solve differential equations using prompted data, during the inference stage without weight updates. However, the current model's overdependence on function data overlooks the invaluable human insight into the operator. To address this, we present a transformation of in-context operator learning into a multi-modal paradigm. In particular, we take inspiration from the recent success of large language models, and propose using "captions" to integrate human knowledge about the operator, expressed through natural language descriptions and equations. Also, we introduce a novel approach to train a language-model-like architecture, or directly fine-tune existing language models, for in-context operator learning. We beat the baseline on single-modal learning tasks, and also demonstrated the effectiveness of multi-modal learning in enhancing performance and reducing function data requirements. The proposed method not only significantly enhanced the development of the in-context operator learning paradigm, but also created a new path for the application of language models.

Fine-Tune Language Models as Multi-Modal Differential Equation Solvers

TL;DR

This work reframes in-context operator learning as a multi-modal task by introducing captions that encode human knowledge about operators, and presents ICON-LM, a language-model–like transformer trained with a next-function prediction objective to map any condition to a QoI across multiple prompts. Compared to the baseline encoder–decoder ICON and classic operator learners like FNO/DeepONet, ICON-LM achieves superior data efficiency and generalization, especially in few-shot settings, and benefits further from caption modalities, with precise captions yielding the strongest gains. The approach enables end-to-end fine-tuning of language models for scientific differential equation solvers, reduces the need for bespoke losses or architectures, and broadens the application of LM ecosystems to heavy numerical computation tasks. The results demonstrate that human-guided, multi-modal inputs can significantly improve operator learning under limited data, suggesting a scalable path for integrating domain knowledge into scientific ML systems.

Abstract

In the growing domain of scientific machine learning, in-context operator learning has shown notable potential in building foundation models, as in this framework the model is trained to learn operators and solve differential equations using prompted data, during the inference stage without weight updates. However, the current model's overdependence on function data overlooks the invaluable human insight into the operator. To address this, we present a transformation of in-context operator learning into a multi-modal paradigm. In particular, we take inspiration from the recent success of large language models, and propose using "captions" to integrate human knowledge about the operator, expressed through natural language descriptions and equations. Also, we introduce a novel approach to train a language-model-like architecture, or directly fine-tune existing language models, for in-context operator learning. We beat the baseline on single-modal learning tasks, and also demonstrated the effectiveness of multi-modal learning in enhancing performance and reducing function data requirements. The proposed method not only significantly enhanced the development of the in-context operator learning paradigm, but also created a new path for the application of language models.
Paper Structure (22 sections, 8 figures, 5 tables)

This paper contains 22 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Diagram for multi-modal in-context operator learning.
  • Figure 2: Depiction of the input/output sequence and model architecture of ICON-LM. The connections in the transformer block are a simplified illustration of the attention mask.
  • Figure 3: The transformer mask for ICON-LM with three condition-QoI pairs. White cells representing ones, and grey cells representing zeros.
  • Figure 4: Comparison of ICON-LM (ours) and encoder-decoder ICON for single-modal in-context operator learning. We calculate the relative testing error averaged over all 19 types of problems, and take the mean and standard deviation over three runs, shown as the solid line and the shaded area, respectively.
  • Figure 5: Comparison of ICON-LM against FNO and DeepONet. (a) relative error during fine-tuning FNO and DeepNet. (b) prediction for a testing operator close to the mean operator. (c) prediction for a testing operator far from the mean operator. (d) five examples and the question condition for the testing operator in (c).
  • ...and 3 more figures