Table of Contents
Fetching ...

INRFlow: Flow Matching for INRs in Ambient Space

Yuyang Wang, Anurag Ranjan, Josh Susskind, Miguel Angel Bautista

TL;DR

INRFlow tackles the challenge of cross-domain generative modeling by performing flow matching directly in ambient space on continuous coordinate-value maps $f: R^d -> R^d$. It replaces the traditional two-stage compressor-plus latent-space modeling with a single-stage, transformer-based architecture that predicts a velocity field conditioned on local context via a latent $z_f_t$, using a point-wise CICFM objective. The method employs a forward process $f_t = a_t f + s_t noise$ and a rectified flow relation $u_t(x,y|eps) = (eps - y)/(1-t)$ to enable continuous, resolution-agnostic generation across images, 3D data, and proteins. Experimental results demonstrate competitive performance against domain-specific baselines and showcase the model's cross-domain applicability, highlighting the potential for a unified ambient-space generative framework. The work opens avenues for more efficient single-stage training and multi-domain co-training in future research.

Abstract

Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on irregular or unstructured data like 3D point clouds or even protein structures. These models are commonly trained in two stages: first, a data compressor is trained, and in a subsequent training stage a flow matching generative model is trained in the latent space of the data compressor. This two-stage paradigm sets obstacles for unifying models across data domains, as hand-crafted compressors architectures are used for different data modalities. To this end, we introduce INRFlow, a domain-agnostic approach to learn flow matching transformers directly in ambient space. Drawing inspiration from INRs, we introduce a conditionally independent point-wise training objective that enables INRFlow to make predictions continuously in coordinate space. Our empirical results demonstrate that INRFlow effectively handles different data modalities such as images, 3D point clouds and protein structure data, achieving strong performance in different domains and outperforming comparable approaches. INRFlow is a promising step towards domain-agnostic flow matching generative models that can be trivially adopted in different data domains.

INRFlow: Flow Matching for INRs in Ambient Space

TL;DR

INRFlow tackles the challenge of cross-domain generative modeling by performing flow matching directly in ambient space on continuous coordinate-value maps . It replaces the traditional two-stage compressor-plus latent-space modeling with a single-stage, transformer-based architecture that predicts a velocity field conditioned on local context via a latent , using a point-wise CICFM objective. The method employs a forward process and a rectified flow relation to enable continuous, resolution-agnostic generation across images, 3D data, and proteins. Experimental results demonstrate competitive performance against domain-specific baselines and showcase the model's cross-domain applicability, highlighting the potential for a unified ambient-space generative framework. The work opens avenues for more efficient single-stage training and multi-domain co-training in future research.

Abstract

Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on irregular or unstructured data like 3D point clouds or even protein structures. These models are commonly trained in two stages: first, a data compressor is trained, and in a subsequent training stage a flow matching generative model is trained in the latent space of the data compressor. This two-stage paradigm sets obstacles for unifying models across data domains, as hand-crafted compressors architectures are used for different data modalities. To this end, we introduce INRFlow, a domain-agnostic approach to learn flow matching transformers directly in ambient space. Drawing inspiration from INRs, we introduce a conditionally independent point-wise training objective that enables INRFlow to make predictions continuously in coordinate space. Our empirical results demonstrate that INRFlow effectively handles different data modalities such as images, 3D point clouds and protein structure data, achieving strong performance in different domains and outperforming comparable approaches. INRFlow is a promising step towards domain-agnostic flow matching generative models that can be trivially adopted in different data domains.

Paper Structure

This paper contains 29 sections, 4 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: (a) High level overview of INRFlow using the image domain as an example. Our model can be interpreted as an encoder-decoder model where the decoder makes predictions independently for each coordinate-value pair given ${\bm{z}}_{f_t}$. For different data domains, the coordinate and value dimensionality changes, but the model is kept the same. (b) Samples generated by INRFlow trained on ImageNet 256$\times$256. (c) Image-to-3D point clouds generated by training INRFlow on Objaverse deitke2023objaverse. (d) Protein structures generated by INRFlow trained on SwissProt boeckmann2003swiss. GT protein structures are depicted in green while the generated structures by INRFlow are show in orange.
  • Figure 2: Architecture of our proposed INRFlow for different data domains including images and 3D point clouds. Note that models are trained for each data domain separately. Each spatial aware latent takes in a subset of neighboring context coordinate-value sets in coordinate space. The latents are then updated through self-attention. Decoded coordinate-value pairs cross attend to the updated latents ${\bm{z}}_{f_t}$ to decode the corresponding velocity.
  • Figure 3: Examples of protein structures predicted by INRFlow on SwissProt, together with their LDDT and TM scores. The GT structures are depicted in green while the generated structures are show in orange. INRFlow accurately captures the global spatial distribution of protein backbones generating reasonable 3D structures for different protein sequences.
  • Figure 4: Examples of resolution agnostic generation for INRFlow models trained on ImageNet-256 (a), and Objaverse-16k in (b). To generate samples at higher resolutions than the one in training we fix the initial noise seed and increase the number of coordinate-value pairs evaluated by the model. Even though INRFlow was only train with samples at a fixed resolution (256 for ImageNet and 16k for Objaverse), it can still generate realistic samples at higher resolutions. These results show that INRFlow is learning a continuous probability density field.
  • Figure 5: FID-50K over training iterations with different model sizes, where we see clear benefits of scaling up model sizes.
  • ...and 9 more figures