Table of Contents
Fetching ...

What Does Flow Matching Bring To TD Learning?

Bhavya Agrawalla, Michal Nauman, Aviral Kumar

TL;DR

It is argued that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms: robust value prediction through test-time recovery and supervising the velocity field at multiple interpolant values.

Abstract

Recent work shows that flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL), but it remains unclear why or how this approach differs from standard critics. Contrary to conventional belief, we show that their success is not explained by distributional RL, as explicitly modeling return distributions can reduce performance. Instead, we argue that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms. First, it enables robust value prediction through \emph{test-time recovery}, whereby iterative computation through integration dampens errors in early value estimates as more integration steps are performed. This recovery mechanism is absent in monolithic critics. Second, supervising the velocity field at multiple interpolant values induces more \emph{plastic} feature learning within the network, allowing critics to represent non-stationary TD targets without discarding previously learned features or overfitting to individual TD targets encountered during training. We formalize these effects and validate them empirically, showing that flow-matching critics substantially outperform monolithic critics (2$\times$ in final performance and around 5$\times$ in sample efficiency) in settings where loss of plasticity poses a challenge e.g., in high-UTD online RL problems, while remaining stable during learning.

What Does Flow Matching Bring To TD Learning?

TL;DR

It is argued that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms: robust value prediction through test-time recovery and supervising the velocity field at multiple interpolant values.

Abstract

Recent work shows that flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL), but it remains unclear why or how this approach differs from standard critics. Contrary to conventional belief, we show that their success is not explained by distributional RL, as explicitly modeling return distributions can reduce performance. Instead, we argue that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms. First, it enables robust value prediction through \emph{test-time recovery}, whereby iterative computation through integration dampens errors in early value estimates as more integration steps are performed. This recovery mechanism is absent in monolithic critics. Second, supervising the velocity field at multiple interpolant values induces more \emph{plastic} feature learning within the network, allowing critics to represent non-stationary TD targets without discarding previously learned features or overfitting to individual TD targets encountered during training. We formalize these effects and validate them empirically, showing that flow-matching critics substantially outperform monolithic critics (2 in final performance and around 5 in sample efficiency) in settings where loss of plasticity poses a challenge e.g., in high-UTD online RL problems, while remaining stable during learning.
Paper Structure (31 sections, 8 theorems, 70 equations, 12 figures, 3 tables)

This paper contains 31 sections, 8 theorems, 70 equations, 12 figures, 3 tables.

Key Result

Theorem 6.1

Consider training monolithic and flow-matching models by minimizing squared error against a non-stationary target $y(m)$. Fix an interval of training steps $m \in [m_0,m_1]$ and suppose i.e., feature directions are frozen during this training interval. Then the following hold:

Figures (12)

  • Figure 1: Flow-matching critics with non-stationary TD targets. Values are computed by integrating a learned velocity field over multiple steps. This iterative process enables test-time recovery (left), where errors in early integration steps are dampened by later steps. When TD targets change ($\theta \rightarrow \theta'$), the integration dynamics can absorb part of the shift, allowing earlier features to remain largely unchanged, improving plasticity.
  • Figure 2: Performance of flow-matching (floq) and monolithic (FQL) critics when trained with target noise. Observe that flow-matching critics are much more robust to noise in TD targets, while performance of FQL (monolithic critic) degrades substantially faster, even when they start at a similar point (antmaze/antsoccer).
  • Figure 3: Feature norms.(a) Learned feature norms and average Q-values for monolithic critics (FQL) and flow-matching critics (floq) in the penultimate and last hidden layers. While the last hidden layer adapts to the scale of Q-values for both methods, the penultimate hidden layer in floq exhibits a much more rapid decrease in feature norms compared to FQL. This indicates that floq learns more flexible and adaptive features in the penultimate hidden layer that are largely decoupled from the magnitude of Q-values. (b) Penultimate hidden layer feature norms for floq trained with TD, SARSA, and MC targets. floq with TD shows the fastest decrease in feature norms, whereas SARSA and MC trends resemble those of the monolithic FQL critic. This suggests that flow-matching critics, particularly under TD learning, develop more robust representations under non-stationary targets.
  • Figure 4: Measuring feature plasticity on four tasks, by freezing all layers except the final two at $T = 0.5M$ steps (gray shaded region denotes the pre-freeze phase). Solid curves correspond to the default (fully trained) runs, while dashed curves show performance after freezing the penultimate hidden features. Across all environments, FQL with a monolithic critic exhibits a sharp performance collapse once features are frozen (brown vs orange), indicating in inability to represent future Q-functions. In contrast, flow-matching critics remain stable and continue to improve after freezing, showing substantially greater plasticity.
  • Figure 5: (a) Frozen features with a monolithic ResNet critic. Although a monolithic ResNet admits a computation graph similar to a flow-matching critic, freezing its features leads to a collapse in performance during subsequent offline RL training. (b) Frozen features with a monolithic transformer-based critic. Despite performing well with the standard FQL algorithm, monolithic transformer critics still suffer a performance collapse when their layers are frozen during subsequent training, indicating that the plasticity issue persists across model architectures. (c) Frozen features with a single integration step. With only one integration step, a flow-matching critic is more stable than a monolithic network, but less stable than full flow matching with multiple integration steps (performance drop indicated by the red arrow), highlighting the essential role of integration in preserving feature plasticity.
  • ...and 7 more figures

Theorems & Definitions (17)

  • Definition 5.1: Test-Time Recovery
  • Definition 5.2: $c$-conic condition on a given velocity field; simplified
  • Theorem 6.1: Flow critics can learn by reweighting existing features; monolithic critics must modify features.
  • Lemma D.3: Trajectory containment
  • proof
  • Theorem D.4: Polynomial Test-Time Recovery
  • proof
  • Lemma E.1: Closed-form predictor
  • proof
  • Lemma E.2: Gradient flow for effective weight vector
  • ...and 7 more