Table of Contents
Fetching ...

Diverse capability and scaling of diffusion and auto-regressive models when learning abstract rules

Binxu Wang, Jiaqi Shang, Haim Sompolinsky

TL;DR

This work compared two generative model families: diffusion (EDM, DiT, SiT) and autoregressive models (GPT2, Mamba) and evaluated their ability to generate structurally consistent samples and perform panel completion via unconditional and conditional sampling.

Abstract

Humans excel at discovering regular structures from limited samples and applying inferred rules to novel settings. We investigate whether modern generative models can similarly learn underlying rules from finite samples and perform reasoning through conditional sampling. Inspired by Raven's Progressive Matrices task, we designed GenRAVEN dataset, where each sample consists of three rows, and one of 40 relational rules governing the object position, number, or attributes applies to all rows. We trained generative models to learn the data distribution, where samples are encoded as integer arrays to focus on rule learning. We compared two generative model families: diffusion (EDM, DiT, SiT) and autoregressive models (GPT2, Mamba). We evaluated their ability to generate structurally consistent samples and perform panel completion via unconditional and conditional sampling. We found diffusion models excel at unconditional generation, producing more novel and consistent samples from scratch and memorizing less, but performing less well in panel completion, even with advanced conditional sampling methods. Conversely, autoregressive models excel at completing missing panels in a rule-consistent manner but generate less consistent samples unconditionally. We observe diverse data scaling behaviors: for both model families, rule learning emerges at a certain dataset size - around 1000s examples per rule. With more training data, diffusion models improve both their unconditional and conditional generation capabilities. However, for autoregressive models, while panel completion improves with more training data, unconditional generation consistency declines. Our findings highlight complementary capabilities and limitations of diffusion and autoregressive models in rule learning and reasoning tasks, suggesting avenues for further research into their mechanisms and potential for human-like reasoning.

Diverse capability and scaling of diffusion and auto-regressive models when learning abstract rules

TL;DR

This work compared two generative model families: diffusion (EDM, DiT, SiT) and autoregressive models (GPT2, Mamba) and evaluated their ability to generate structurally consistent samples and perform panel completion via unconditional and conditional sampling.

Abstract

Humans excel at discovering regular structures from limited samples and applying inferred rules to novel settings. We investigate whether modern generative models can similarly learn underlying rules from finite samples and perform reasoning through conditional sampling. Inspired by Raven's Progressive Matrices task, we designed GenRAVEN dataset, where each sample consists of three rows, and one of 40 relational rules governing the object position, number, or attributes applies to all rows. We trained generative models to learn the data distribution, where samples are encoded as integer arrays to focus on rule learning. We compared two generative model families: diffusion (EDM, DiT, SiT) and autoregressive models (GPT2, Mamba). We evaluated their ability to generate structurally consistent samples and perform panel completion via unconditional and conditional sampling. We found diffusion models excel at unconditional generation, producing more novel and consistent samples from scratch and memorizing less, but performing less well in panel completion, even with advanced conditional sampling methods. Conversely, autoregressive models excel at completing missing panels in a rule-consistent manner but generate less consistent samples unconditionally. We observe diverse data scaling behaviors: for both model families, rule learning emerges at a certain dataset size - around 1000s examples per rule. With more training data, diffusion models improve both their unconditional and conditional generation capabilities. However, for autoregressive models, while panel completion improves with more training data, unconditional generation consistency declines. Our findings highlight complementary capabilities and limitations of diffusion and autoregressive models in rule learning and reasoning tasks, suggesting avenues for further research into their mechanisms and potential for human-like reasoning.

Paper Structure

This paper contains 38 sections, 5 figures.

Figures (5)

  • Figure 1: Design of the studyA. Example Raven's progression matrix, and its encoding as a 3$\times$9$\times$9 integer array. The underlying rule is constantshape. B.C. Two families of generative models: Diffusion and autoregressive model, and their training method: denoising and predicting the next token. D. The 40 relational rules, with 5 rules held out during training.
  • Figure 2: Diffusion models lead in generating structurally consistent samples ab initioA.B. Dynamics of sample consistency per valid row fraction and C2,C3 fraction during training, for diffusion model (A. DiT-S/1) and autoregressive model (B. GPT2) C. Comparison of ab initio generation consistency across model families. D.E. Frequency of generating C3 samples of each rule (showing the value $\times$40 to normalize) for DiT-S and GPT2-M. Magenta frames showing the 5 rules held out from generative model training.
  • Figure 3: Diffusion and autoregressive models show diverse data memorization propertyA. B. Memorization of training and control set at multiple levels for samples generated through training, for A. DiT-S, and B. GPT2-M. Solid lines show what fraction of samples, rows, and panels have copies from the training set, and dashed lines show the control, i.e. the fraction of those with copies from the control set of samples unseen during training.
  • Figure 4: Autoregressive models lead in rule consistent panel completionA. B. Learning dynamics of panel completion accuracy for trained and held out rules, for diffusion model (A. DiT-S/1) and autoregressive model (B. GPT2) C. Comparison of panel completion accuracy across model class and sampler. D. E. Panel completion accuracy per rule for DiT-S and GPT2-M after training. F. G. Correlation between panel completion accuracy and C3 sample generation frequency (Fig.\ref{['fig:uncond_generation']}D.E.) per rule for DiT-S and GPT2-M after training.
  • Figure 5: Diverse scaling behavior of Diffusion and Autoregressive modelsA.B. Data scaling curve of Diffusion models (EDM, DiT, SiT) A.ab initio generation consistency (C3 fraction) and B. panel completion accuracy. C.D. Analogous data scaling curve of autoregressive model (GPT2)