Table of Contents
Fetching ...

Discrete Diffusion Models Exploit Asymmetry to Solve Lookahead Planning Tasks

Itamar Trainin, Shauli Ravfogel, Omri Abend, Amir Feder

TL;DR

This work investigates the distinct emergent mechanisms that arise when training AR versus Non-Autoregressive (NAR) models, such as Discrete Diffusion Models (dLLMs), on lookahead tasks and demonstrates that NAR models learn to solve planning tasks by utilizing future tokens to decode backwards, avoiding the need to learn complex traversal mechanisms entirely.

Abstract

While Autoregressive (AR) Transformer-based Generative Language Models are frequently employed for lookahead tasks, recent research suggests a potential discrepancy in their ability to perform planning tasks that require multi-step lookahead. In this work, we investigate the distinct emergent mechanisms that arise when training AR versus Non-Autoregressive (NAR) models, such as Discrete Diffusion Models (dLLMs), on lookahead tasks. By requiring the models to plan ahead to reach the correct conclusion, we analyze how these two paradigms fundamentally differ in their approach to the problem. We identify a critical asymmetry in planning problems: while forward generation requires complex lookahead at branching junctions, reverse generation is often deterministic. This asymmetry creates an opportunity for NAR models. Through mechanistic analysis of training and inference dynamics, we demonstrate that NAR models learn to solve planning tasks by utilizing future tokens to decode backwards, avoiding the need to learn complex traversal mechanisms entirely. Consequently, we report that both AR and NAR models are able to achieve perfect accuracy on the lookahead task. However, NAR models require exponentially fewer training examples and shallower architectures compared to AR models, which often fail to converge without specific curriculum adjustments.

Discrete Diffusion Models Exploit Asymmetry to Solve Lookahead Planning Tasks

TL;DR

This work investigates the distinct emergent mechanisms that arise when training AR versus Non-Autoregressive (NAR) models, such as Discrete Diffusion Models (dLLMs), on lookahead tasks and demonstrates that NAR models learn to solve planning tasks by utilizing future tokens to decode backwards, avoiding the need to learn complex traversal mechanisms entirely.

Abstract

While Autoregressive (AR) Transformer-based Generative Language Models are frequently employed for lookahead tasks, recent research suggests a potential discrepancy in their ability to perform planning tasks that require multi-step lookahead. In this work, we investigate the distinct emergent mechanisms that arise when training AR versus Non-Autoregressive (NAR) models, such as Discrete Diffusion Models (dLLMs), on lookahead tasks. By requiring the models to plan ahead to reach the correct conclusion, we analyze how these two paradigms fundamentally differ in their approach to the problem. We identify a critical asymmetry in planning problems: while forward generation requires complex lookahead at branching junctions, reverse generation is often deterministic. This asymmetry creates an opportunity for NAR models. Through mechanistic analysis of training and inference dynamics, we demonstrate that NAR models learn to solve planning tasks by utilizing future tokens to decode backwards, avoiding the need to learn complex traversal mechanisms entirely. Consequently, we report that both AR and NAR models are able to achieve perfect accuracy on the lookahead task. However, NAR models require exponentially fewer training examples and shallower architectures compared to AR models, which often fail to converge without specific curriculum adjustments.
Paper Structure (16 sections, 6 equations, 9 figures, 1 table)

This paper contains 16 sections, 6 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Depiction of a lookahead Star-Graph with 3 arms and 3 vertices in each arm, and a visualization of its tokenized sequence format for Original lookahead task, $1^{st}-Order$ task where the path is decoded in reverse, $\ell^{th}-Order$ task where only the first and second vertices are predicted and "Hinted" task where the true first and second vertices are provided in context. Purple indicates tokens predicted at test time.
  • Figure 2: Comparison of AR models trained with (orange) and without (pink) gradients on the graph prefix across graph settings. All models were trained on up-to 50M distinct training examples.
  • Figure 3: A visualization of the dLLM's NAR decoding process across different graph settings. The x-axis represents the vertex position in the path, and the y-axis represents the decoding step. Color indicates the percentage of examples where a specific token was unmasked (predicted) at a given step.
  • Figure 4: A comparison of the training dynamics of AR and NAR models depicting the per-position (vertex) average accuracy over the test set. Note that "G(2,10) - AR" did not reach convergence. For visualization purposes, the x-axis scale varies across.
  • Figure 5: Training convergence comparison between AR and NAR models across varying graph complexities ($G(d, l)$). The y-axis denotes the exact-match accuracy on a held-out test set, while the x-axis indicates the number of unique training examples observed.
  • ...and 4 more figures