Table of Contents
Fetching ...

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu

TL;DR

This work introduces Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching that consistently outperforms REPA baselines and finds that alignment is most effective when applied to the causally dominant layers that drive the velocity field.

Abstract

REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduce Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call it Store-Contribute Dissociation (SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer's causal contribution via the induced change in the predicted velocity field, enabling sparse layer selection and adaptive weighting for alignment. Across unified speech and general-audio training (LibriSpeech + AudioSet) under different token-conditioning topologies, AG-REPA consistently outperforms REPA baselines. Overall, our results show that alignment is most effective when applied to the causally dominant layers that drive the velocity field, rather than to layers that are representationally rich but functionally passive.

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

TL;DR

This work introduces Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching that consistently outperforms REPA baselines and finds that alignment is most effective when applied to the causally dominant layers that drive the velocity field.

Abstract

REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduce Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call it Store-Contribute Dissociation (SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer's causal contribution via the induced change in the predicted velocity field, enabling sparse layer selection and adaptive weighting for alignment. Across unified speech and general-audio training (LibriSpeech + AudioSet) under different token-conditioning topologies, AG-REPA consistently outperforms REPA baselines. Overall, our results show that alignment is most effective when applied to the causally dominant layers that drive the velocity field, rather than to layers that are representationally rich but functionally passive.
Paper Structure (48 sections, 23 equations, 4 figures, 4 tables)

This paper contains 48 sections, 23 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The Spatiotemporal Anatomy of SCD. We visualize the layer-wise dynamics across diffusion time steps ($t=0 \to 1$). (a) Representation: The LASP score indicates that semantic information is consistently stored in the deep layers (L20-24), acting as a static knowledge reservoir independent of the generation phase. (b) Contribution: In contrast, the FoG-A score reveals a dynamic functional landscape. While early layers (L1-3) act as the primary Causal Driver due to Jacobian sensitivity, a critical Dynamic Transition occurs in the middle layers (L6-12) during the intermediate denoising phase ($t \approx 0.5$). Key Insight: This spatiotemporal mismatch explains why static alignment heuristics (e.g., fixing Layer 1 or Layer 8) are suboptimal: they fail to account for the shift in functional importance from early to middle layers during the intermediate denoising phase.
  • Figure 2: Diagnosing Representation Storage.(a) BiT-C establishes a dual-modality supervision baseline using frozen Whisper (semantic) and BEATs (acoustic) teachers to anchor the conditioning interface. (b) LASP probes "what the network knows" by projecting layer-wise representations into a shared teacher space using a frozen head, enabling cross-layer comparison of information storage.
  • Figure 3: From Causal Attribution to Optimization.(a) FoG-A determines "what the network uses" by measuring the velocity field perturbation ($\|v_\theta^{\setminus k} - v_\theta\|$) caused by ablating individual layers, creating a functional attribution map. (b) AG-REPA utilizes the causal insights from FoG-A to selectively apply alignment supervision only to critical layers, bridging the gap between representation storage and causal contribution identified as the SCD.
  • Figure 4: The Unified Audio Generation Framework. The system utilizes a two-stage cascade architecture decoupling semantic planning from acoustic rendering: (a) Tokenization: Domain-specific pathways process inputs into a unified discrete sequence, utilizing $S^3$ tokens for speech and AudioSet (AS) tokens for audio, optionally interleaved with BEATs features to inject dense acoustic priors. (b) Stage 1 (Autoregressive LLM): A causal language model predicts target acoustic tokens, incorporating reference style via a learnable projection and injection module ($P_{\text{style}}$) while utilizing an auxiliary coarse prediction head. (c) Stage 2 (Flow Matching): A Diffusion Transformer (DiT) backbone predicts the velocity field $v_\theta$ to transform noise into target mel-spectrograms via Flow Matching, conditioned on projected tokens and reference audio, before final waveform synthesis by a neural vocoder.