Table of Contents
Fetching ...

Decoding Partial Differential Equations: Cross-Modal Adaptation of Decoder-only Models to PDEs

Paloma García-de-Herreros, Philipp Slusallek, Dietrich Klakow, Vagrant Gautam

TL;DR

The paper investigates how decoder-only and encoder-only transformer models perform in cross-modal adaptation for time-dependent PDE simulations, revealing that decoder-only models underperform with standard approaches and scaling alone does not close the gap. It introduces two bidirectionality-mimicking strategies, Parallel Flipping and Sequence Doubling, which improve decoder-only performance across PDEBench tasks and, in many cases, close the gap to encoder-only models. The findings suggest that autoregressive limitations hinder cross-modal adaptation, and that bidirectionality-inspired techniques can expand the set of viable models for scientific ML. The work also discusses tradeoffs in compute and memory, and highlights directions for future research on stability and more fundamental bidirectional mechanisms.

Abstract

Large language models have shown great success on natural language tasks in recent years, but they have also shown great promise when adapted to new modalities, e.g., for scientific machine learning tasks. Even though decoder-only models are more popular within NLP and scale exceedingly well at generating natural language, most proposed approaches for cross-modal adaptation focus on encoder-only models, raising the question of how model architecture affects these approaches. In this paper, we therefore perform a series of ablation studies to answer this question, systematically comparing encoder-only and decoder-only models on cross-modal adaptation for time-dependent simulation tasks based on partial differential equations (PDEs). We find that decoder-only models are far worse than encoder-only models, when existing approaches are applied unmodified. In contrast to several other domains, scaling decoder-only models also does not help. To harness the potential of decoder-only models in this context, we introduce two novel approaches, Parallel Flipping and Sequence Doubling, attempting to mimic bidirectionality in autoregressive models. Both our methods improve overall performance using decoder-only models for all tasks and all cross-model adaptation methods, closing the gap to encoder-only model performance. We hope that our findings broaden the spectrum of models used on cross-modal adaptation tasks to further scientific ML.

Decoding Partial Differential Equations: Cross-Modal Adaptation of Decoder-only Models to PDEs

TL;DR

The paper investigates how decoder-only and encoder-only transformer models perform in cross-modal adaptation for time-dependent PDE simulations, revealing that decoder-only models underperform with standard approaches and scaling alone does not close the gap. It introduces two bidirectionality-mimicking strategies, Parallel Flipping and Sequence Doubling, which improve decoder-only performance across PDEBench tasks and, in many cases, close the gap to encoder-only models. The findings suggest that autoregressive limitations hinder cross-modal adaptation, and that bidirectionality-inspired techniques can expand the set of viable models for scientific ML. The work also discusses tradeoffs in compute and memory, and highlights directions for future research on stability and more fundamental bidirectional mechanisms.

Abstract

Large language models have shown great success on natural language tasks in recent years, but they have also shown great promise when adapted to new modalities, e.g., for scientific machine learning tasks. Even though decoder-only models are more popular within NLP and scale exceedingly well at generating natural language, most proposed approaches for cross-modal adaptation focus on encoder-only models, raising the question of how model architecture affects these approaches. In this paper, we therefore perform a series of ablation studies to answer this question, systematically comparing encoder-only and decoder-only models on cross-modal adaptation for time-dependent simulation tasks based on partial differential equations (PDEs). We find that decoder-only models are far worse than encoder-only models, when existing approaches are applied unmodified. In contrast to several other domains, scaling decoder-only models also does not help. To harness the potential of decoder-only models in this context, we introduce two novel approaches, Parallel Flipping and Sequence Doubling, attempting to mimic bidirectionality in autoregressive models. Both our methods improve overall performance using decoder-only models for all tasks and all cross-model adaptation methods, closing the gap to encoder-only model performance. We hope that our findings broaden the spectrum of models used on cross-modal adaptation tasks to further scientific ML.

Paper Structure

This paper contains 27 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Cross-modal adaptation of GPT-2, a decoder-only model, with ORCA-based adaptation on the Advection dataset of time-dependent PDE simulation. Although the original setup shows high error, our proposed methods (Parallel Flipping and Sequence Doubling) close the gap to encoder-only model performance.
  • Figure 2: Comparison of model performance with ORCA- (above) and FPT-based (below) cross-modal adaptation, using both pre-trained and randomly-initialized versions of encoder-only models (RoBERTa, BERT) and decoder-only models GPT-2 and Pythia). Performance is measured using nRSME, where lower is better; the plots show average performance over 5 random seeds, and the error bars represent the best and worst runs.
  • Figure 3: Performance of different sizes of models of the GPT-2 family and Pythia family using both ORCA shen2023cross and FPT lu2022frozen. The plots depict the average performance over 5 random seeds. Once again, performance is measured using nRSME, where lower is better. If scaling the models was improving the performance, downward trends could've been seen for the different model families.
  • Figure 4: Pipeline comparison of the original setup and the two methods we introduce, Parallel Flipping and Sequence Doubling. For Parallel Flipping, the pipeline is run twice, with the original data and with the inverted sequences. For Sequence Doubling, each sequence is concatenated with itself before being introduced to the model, and then we only pass the second part of the last hidden layer to the predictor.
  • Figure 5: Performance comparison of the original setup versus our own two methods, Parallel Flipping and Sequence Doubling, using both ORCA shen2023cross and FPT lu2022frozen. We set RoBERTa with the original setup as a baseline for all the configurations. The plots depict the average performance over 5 random seeds. Performance is measured using nRSME, where lower is better.
  • ...and 2 more figures