CaLMFlow: Volterra Flow Matching using Causal Language Models

Sizhuang He; Daniel Levine; Ivan Vrkic; Marco Francesco Bressana; David Zhang; Syed Asad Rizvi; Yangtian Zhang; Emanuele Zappala; David van Dijk

CaLMFlow: Volterra Flow Matching using Causal Language Models

Sizhuang He, Daniel Levine, Ivan Vrkic, Marco Francesco Bressana, David Zhang, Syed Asad Rizvi, Yangtian Zhang, Emanuele Zappala, David van Dijk

TL;DR

This work introduces CaLMFlow, a novel framework that casts flow matching as a Volterra integral equation (VIE), leveraging the power of large language models (LLMs) for continuous data generation and highlights LLM-driven flow matching as a promising paradigm in generative modeling.

Abstract

We introduce CaLMFlow (Causal Language Models for Flow Matching), a novel framework that casts flow matching as a Volterra integral equation (VIE), leveraging the power of large language models (LLMs) for continuous data generation. CaLMFlow enables the direct application of LLMs to learn complex flows by formulating flow matching as a sequence modeling task, bridging discrete language modeling and continuous generative modeling. Our method implements tokenization across space and time, thereby solving a VIE over these domains. This approach enables efficient handling of high-dimensional data and outperforms ODE solver-dependent methods like conditional flow matching (CFM). We demonstrate CaLMFlow's effectiveness on synthetic and real-world data, including single-cell perturbation response prediction, showcasing its ability to incorporate textual context and generalize to unseen conditions. Our results highlight LLM-driven flow matching as a promising paradigm in generative modeling, offering improved scalability, flexibility, and context-awareness.

CaLMFlow: Volterra Flow Matching using Causal Language Models

TL;DR

Abstract

Paper Structure (42 sections, 1 theorem, 23 equations, 9 figures, 7 tables)

This paper contains 42 sections, 1 theorem, 23 equations, 9 figures, 7 tables.

Introduction
Related Work
Volterra Flow Matching
Flow Matching as Volterra Integral Equations
Solving Volterra Integral Equations with Causal Language Models
Continuous Space Tokens via Variational Decoding
Spatiotemporal and Multi-trajectory Tokenization
Spatiotemporal Tokenization
Multi-trajectory Tokenization
Experiments
Synthetic Datasets
High Dimensional Data
Multi-trajectory Context
Single-cell Generation
Unconditional Generation of Single-cell Data
...and 27 more sections

Key Result

Theorem 1

Assuming that $p_t(x) > 0$ for all $x \in \mathbb{R}^d$ and $t \in [0, 1]$, then, up to a constant independent of $\theta$, $\mathcal{L}_{\text{CFM}}$ and $\mathcal{L}_{\text{FM}}$ are equal. Hence, $\nabla_\theta \mathcal{L}_{\text{FM}}(\theta) = \nabla_\theta \mathcal{L}_{\text{CFM}}(\theta)$.

Figures (9)

Figure 1: Overview of the CaLMFlow framework. CaLMFlow takes as input textual conditions and flows and generates the next time point for the flows. The textual condition is tokenized and embedded using the LLM tokenizer and embedding layer while the conditional flows are transformed into spatial-temporal tokens using a learned projection. If multiple conditional flows are input simultaneously, the tokens are ordered by flow, space, and then time. The LLM applies causal language modeling and generates the next time point for each flow.
Figure 2: Heatmaps of the ground truth 4 Gaussians dataset (\ref{['fig:multi_pt_gt']}) and that generated by CFM (\ref{['fig:multi_pt_cfm']}), CaLMFlow (1 traj.) (\ref{['fig:multi_pt_calmflow_1']}), and CaLMFlow (8 traj.) (\ref{['fig:multi_pt_calmflow_8']}). Both variants of CaLMFlow generate a distribution that closely matches the ground truth, with the 8-trajectory version further enhancing performance by distributing the data more evenly and accurately.
Figure 3: Comparison of conditional generation quality across different models for single-cell perturbation data. CaLMFlow (\ref{['fig:cond_sc_comb_umap_calmflow_ri']} and \ref{['fig:cond_sc_comb_umap_calmflow_nl']}) exhibits strong overlap between generated data distribution (blue) and the ground-truth distribution (orange), highlighting its superior capability to model data with unseen combinatorial perturbations. In contrast, other models struggle to produce realistic samples. For CaLMFlow, R.I. refers to randomly initialized CLM, and N.L. refers to natural language pretrained CLM.
Figure 4: Ablation results on temperature. Left: CaLMFlow generated data from 8gaussians to 2moons, using different temperature values. Right: 2-Wasserstein and MMD performances as a function of temperature. The plots show that a low, non-zero temperature value ($\tau$=0.2) produces the best performance and that the VAE is necessary.
Figure 5: Comparison of the ground truth data and model-generated data, colored by Leiden labels. For the generated data, the Leiden labels are predicted by an MLP classifier trained on the ground truth. Both variants of CaLMFlow successfully generate data spanning all clusters and closely align with the ground truth distribution. In contrast, while CFM, CFM-OT, and CFM-SB generate data across all classes, they fail to differentiate between them, indicating a mismatch in the underlying community structures. Models such as scVI, scGPT, and CPA are unable to generate data for some classes altogether.
...and 4 more figures

Theorems & Definitions (1)

Theorem : lipman2022flow

CaLMFlow: Volterra Flow Matching using Causal Language Models

TL;DR

Abstract

CaLMFlow: Volterra Flow Matching using Causal Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (1)