Table of Contents
Fetching ...

The Quadratic Geometry of Flow Matching: Semantic Granularity Alignment for Text-to-Image Synthesis

Zhinan Xiong, Shunqi Yuan

TL;DR

This work observes that under the Flow Matching framework, the standard MSE objective can be formulated as a Quadratic Form governed by a dynamically evolving Neural Tangent Kernel (NTK), and proposes Semantic Granularity Alignment, which advances the efficiency-quality trade-off by accelerating convergence and improving structural integrity.

Abstract

In this work, we analyze the optimization dynamics of generative fine-tuning. We observe that under the Flow Matching framework, the standard MSE objective can be formulated as a Quadratic Form governed by a dynamically evolving Neural Tangent Kernel (NTK). This geometric perspective reveals a latent Data Interaction Matrix, where diagonal terms represent independent sample learning and off-diagonal terms encode residual correlation between heterogeneous features. Although standard training implicitly optimizes these cross-term interferences, it does so without explicit control; moreover, the prevailing data-homogeneity assumption may constrain the model's effective capacity. Motivated by this insight, we propose Semantic Granularity Alignment (SGA), using Text-to-Image synthesis as a testbed. SGA engineers targeted interventions in the vector residual field to mitigate gradient conflicts. Evaluations across DiT and U-Net architectures confirm that SGA advances the efficiency-quality trade-off by accelerating convergence and improving structural integrity.

The Quadratic Geometry of Flow Matching: Semantic Granularity Alignment for Text-to-Image Synthesis

TL;DR

This work observes that under the Flow Matching framework, the standard MSE objective can be formulated as a Quadratic Form governed by a dynamically evolving Neural Tangent Kernel (NTK), and proposes Semantic Granularity Alignment, which advances the efficiency-quality trade-off by accelerating convergence and improving structural integrity.

Abstract

In this work, we analyze the optimization dynamics of generative fine-tuning. We observe that under the Flow Matching framework, the standard MSE objective can be formulated as a Quadratic Form governed by a dynamically evolving Neural Tangent Kernel (NTK). This geometric perspective reveals a latent Data Interaction Matrix, where diagonal terms represent independent sample learning and off-diagonal terms encode residual correlation between heterogeneous features. Although standard training implicitly optimizes these cross-term interferences, it does so without explicit control; moreover, the prevailing data-homogeneity assumption may constrain the model's effective capacity. Motivated by this insight, we propose Semantic Granularity Alignment (SGA), using Text-to-Image synthesis as a testbed. SGA engineers targeted interventions in the vector residual field to mitigate gradient conflicts. Evaluations across DiT and U-Net architectures confirm that SGA advances the efficiency-quality trade-off by accelerating convergence and improving structural integrity.
Paper Structure (70 sections, 45 equations, 21 figures, 7 tables, 2 algorithms)

This paper contains 70 sections, 45 equations, 21 figures, 7 tables, 2 algorithms.

Figures (21)

  • Figure 1: Qualitative comparison between the Baseline and our SGA method across three GDA domains. For each row, the reference images (left) define the target domain; the Baseline (middle) fails to preserve domain-specific attributes, while SGA (right) faithfully captures the target domain characteristics.
  • Figure 2: Structure of the Data Interference Matrix $\mathbf{\Omega}$. Diagonal entries (blue) represent independent learning within each sub-manifold. Off-diagonal entries encode cross-scale interactions: constructive (synergy, green) or destructive (conflict, red).
  • Figure 3: The optimization landscape $\mathcal{L}_{FM} = \boldsymbol{\alpha}^\top \mathbf{\Omega}\,\boldsymbol{\alpha}$ of CFM fine-tuning. Path 1 (red): Underfitting Region. Path 3 (blue): OOD Region. Only Path 2 (green) navigates the narrow Goal Region.
  • Figure 4: Failure analysis of standard fine-tuning. All outputs exhibit dominant prior characteristics regardless of target domain---a prior capture phenomenon that traps the model in the Underfitting Region of \ref{['fig:loss-landscape']}.
  • Figure 5: Overview of the Hierarchical Semantic Decomposition (H-SD) pipeline. Raw images are parsed by a general-purpose detector (, multimodal models, YOLO, or Grounding DINO) into three granularities ($Y_{\text{Macro}}$, $Y_{\text{Meso}}$, $Y_{\text{Micro}}$), filtered via IoU-based redundancy elimination, and post-processed into aspect-ratio-preserving resolution buckets.
  • ...and 16 more figures