DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation

Aleksei Liuliakov; Luca Hermes; Barbara Hammer

DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation

Aleksei Liuliakov, Luca Hermes, Barbara Hammer

TL;DR

Directed Graph Policy Optimization is proposed, which extends reinforcement learning fine-tuning of discrete graph diffusion models to DAGs via topological node ordering and positional encoding and demonstrates that reinforcement learning-steered discrete diffusion, once extended to handle directionality, provides a controllable generative framework for directed combinatorial structures.

Abstract

Reinforcement learning fine-tuning has proven effective for steering generative diffusion models toward desired properties in image and molecular domains. Graph diffusion models have similarly been applied to combinatorial structure generation, including neural architecture search (NAS). However, neural architectures are directed acyclic graphs (DAGs) where edge direction encodes functional semantics such as data flow-information that existing graph diffusion methods, designed for undirected structures, discard. We propose Directed Graph Policy Optimization (DGPO), which extends reinforcement learning fine-tuning of discrete graph diffusion models to DAGs via topological node ordering and positional encoding. Validated on NAS-Bench-101 and NAS-Bench-201, DGPO matches the benchmark optimum on all three NAS-Bench-201 tasks (91.61%, 73.49%, 46.77%). The central finding is that the model learns transferable structural priors: pretrained on only 7% of the search space, it generates near-oracle architectures after fine-tuning, within 0.32 percentage points of the full-data model and extrapolating 7.3 percentage points beyond its training ceiling. Bidirectional control experiments confirm genuine reward-driven steering, with inverse optimization reaching near random-chance accuracy (9.5%). These results demonstrate that reinforcement learning-steered discrete diffusion, once extended to handle directionality, provides a controllable generative framework for directed combinatorial structures.

DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation

TL;DR

Abstract

Paper Structure (13 sections, 3 equations, 4 figures, 2 tables)

This paper contains 13 sections, 3 equations, 4 figures, 2 tables.

Introduction
Related Work
Method
Preliminaries
DGPO: Extending GDPO to Directed Acyclic Graphs
Training Objective
Two-Phase Training
Experiments
Experimental Setup
Baseline Results
Transferable Structural Priors
Steering Versatility
Conclusion

Figures (4)

Figure 1: Training dynamics of DGPO on (a) NB101 and (b) NB201 (CIFAR-10): validation accuracy (solid, left axis) and mean reward (dashed, right axis) over RL-FT epochs, with ${\pm}1\sigma$ bands across 3 seeds. Horizontal lines: random search and pretrained-only baselines. Both metrics converge reliably, confirming that RL fine-tuning steers the generation distribution toward higher-quality architectures.
Figure 2: Distribution of generated architectures over RL-FT epochs on (a) NB101 and (b) NB201 (CIFAR-10). Each strip shows 300 sampled architectures (dots) at a given epoch, with mean accuracy (diamond) and top-5 architectures (stars). As training progresses (bottom to top), the mean and overall sample density shift markedly toward higher accuracy, demonstrating that RL fine-tuning reshapes the generative distribution rather than merely selecting isolated high performers. Single seed (42); $n{=}300$ samples per epoch.
Figure 3: OOD architecture discovery on (a, c) NB101 ($\mathcal{T}{=}0.87$) and (b, d) NB201 ($\mathcal{T}{=}0.85$). Top: distribution comparison - full pretrain (reference), filtered pretrain (epoch 0), and RL-FT (final epoch). Shaded region marks OOD architectures above $\mathcal{T}$. Bottom: threshold crossing rate over RL-FT epochs. After pretraining on only sub-threshold architectures, RL-FT recovers above-threshold generation, demonstrating transferable structural priors. Single seed (42) for distributions; crossing rates are 3-seed aggregates.
Figure 4: Bidirectional steering on (a) NB101 and (b) NB201 (CIFAR-10): forward (maximize, $\uparrow$) and inverse (minimize, $\downarrow$) DGPO trajectories over RL-FT epochs, with ${\pm}1\sigma$ bands (3 seeds). Dashed/dotted lines: expected max/min of random batches ($N{=}15$, bootstrap $K{=}10{,}000$). The inverse trajectory converges to near random-chance accuracy (${\sim}$9.5%), supporting reward-driven distribution steering.

DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation

TL;DR

Abstract

DGPO: RL-Steered Graph Diffusion for Neural Architecture Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)