Table of Contents
Fetching ...

A Neural Affinity Framework for Abstract Reasoning: Diagnosing the Compositional Gap in Transformer Architectures via Procedural Task Taxonomy

Miguel Ingram, Arthur Joseph Merritt

TL;DR

This work argues that transformer architectures fundamentally misalign with the primitives required for abstract reasoning in ARC. It introduces a Neural Affinity Framework and a 9-category taxonomy of re-arc tasks, validated both code- and visually, and demonstrated transfer to ARC-AGI-2 tasks. The key findings reveal a pervasive Compositional Gap—high local pattern mastery but poor global synthesis—and a Neural Affinity Ceiling that caps performance in low-affinity task categories. External validation on specialist ViTARC models confirms the architectural nature of these ceilings, encouraging hybrid, affinity-aligned architectures over sheer scaling. The study provides a reproducible diagnostic toolkit to steer architectural innovation toward modular, priordriven components tailored to task primitives.

Abstract

Responding to Hodel et al.'s (2024) call for a formal definition of task relatedness in re-arc, we present the first 9-category taxonomy of all 400 tasks, validated at 97.5% accuracy via rule-based code analysis. We prove the taxonomy's visual coherence by training a CNN on raw grid pixels (95.24% accuracy on S3, 36.25% overall, 3.3x chance), then apply the taxonomy diagnostically to the original ARC-AGI-2 test set. Our curriculum analysis reveals 35.3% of tasks exhibit low neural affinity for Transformers--a distributional bias mirroring ARC-AGI-2. To probe this misalignment, we fine-tuned a 1.7M-parameter Transformer across 302 tasks, revealing a profound Compositional Gap: 210 of 302 tasks (69.5%) achieve >80% cell accuracy (local patterns) but <10% grid accuracy (global synthesis). This provides direct evidence for a Neural Affinity Ceiling Effect, where performance is bounded by architectural suitability, not curriculum. Applying our framework to Li et al.'s independent ViTARC study (400 specialists, 1M examples each) confirms its predictive power: Very Low affinity tasks achieve 51.9% versus 77.7% for High affinity (p<0.001), with a task at 0% despite massive data. The taxonomy enables precise diagnosis: low-affinity tasks (A2) hit hard ceilings, while high-affinity tasks (C1) reach 99.8%. These findings indicate that progress requires hybrid architectures with affinity-aligned modules. We release our validated taxonomy,

A Neural Affinity Framework for Abstract Reasoning: Diagnosing the Compositional Gap in Transformer Architectures via Procedural Task Taxonomy

TL;DR

This work argues that transformer architectures fundamentally misalign with the primitives required for abstract reasoning in ARC. It introduces a Neural Affinity Framework and a 9-category taxonomy of re-arc tasks, validated both code- and visually, and demonstrated transfer to ARC-AGI-2 tasks. The key findings reveal a pervasive Compositional Gap—high local pattern mastery but poor global synthesis—and a Neural Affinity Ceiling that caps performance in low-affinity task categories. External validation on specialist ViTARC models confirms the architectural nature of these ceilings, encouraging hybrid, affinity-aligned architectures over sheer scaling. The study provides a reproducible diagnostic toolkit to steer architectural innovation toward modular, priordriven components tailored to task primitives.

Abstract

Responding to Hodel et al.'s (2024) call for a formal definition of task relatedness in re-arc, we present the first 9-category taxonomy of all 400 tasks, validated at 97.5% accuracy via rule-based code analysis. We prove the taxonomy's visual coherence by training a CNN on raw grid pixels (95.24% accuracy on S3, 36.25% overall, 3.3x chance), then apply the taxonomy diagnostically to the original ARC-AGI-2 test set. Our curriculum analysis reveals 35.3% of tasks exhibit low neural affinity for Transformers--a distributional bias mirroring ARC-AGI-2. To probe this misalignment, we fine-tuned a 1.7M-parameter Transformer across 302 tasks, revealing a profound Compositional Gap: 210 of 302 tasks (69.5%) achieve >80% cell accuracy (local patterns) but <10% grid accuracy (global synthesis). This provides direct evidence for a Neural Affinity Ceiling Effect, where performance is bounded by architectural suitability, not curriculum. Applying our framework to Li et al.'s independent ViTARC study (400 specialists, 1M examples each) confirms its predictive power: Very Low affinity tasks achieve 51.9% versus 77.7% for High affinity (p<0.001), with a task at 0% despite massive data. The taxonomy enables precise diagnosis: low-affinity tasks (A2) hit hard ceilings, while high-affinity tasks (C1) reach 99.8%. These findings indicate that progress requires hybrid architectures with affinity-aligned modules. We release our validated taxonomy,

Paper Structure

This paper contains 96 sections, 1 equation, 10 figures, 9 tables.

Figures (10)

  • Figure A1: Bar chart showing the distribution of 400 re-arc tasks across 9 taxonomy categories, color-coded by Neural Affinity level (High: green, Medium: blue, Low: orange, Very Low: red). The horizontal dashed line at 35.3% marks the total proportion of Low and Very Low affinity tasks in the curriculum, demonstrating significant bias toward architecturally challenging primitives. S3 (Topological) is the dominant category at 27.0%, followed by C1 (Color Transform) at 24.8%. Categories are sorted by count (descending), with affinity levels indicated by color to reveal the curriculum's architectural challenge profile. See Appendix \ref{['app:reproducibility']} for generation details.
  • Figure A2: Grid-Cell Dissociation During Training. Left panel: Grid accuracy peaks at epoch 23 (3.25%) before plateauing. Right panel: Cell accuracy continues improving to epoch 36 (81.80%). The 13-epoch gap reveals the model learns local patterns but struggles with global composition.
  • Figure A3: Ablation Study Development Insights. Left panel: Decoder-only (Exp-1) exhibits pathological training dynamics, peaking at epoch 1 before degrading, while encoder-decoder (Exp0) maintains stable improvement. Right panel: Increasing max_grid_size from 30 to 35 causes 31% performance degradation, demonstrating critical sensitivity to architectural hyperparameters.
  • Figure A4: LoRA Fine-Tuning Efficiency by Affinity Level. Left panel: High-affinity tasks (base cell accuracy $>85\%$) achieve substantially higher final grid accuracy (58.0%) compared to low-affinity tasks ($<70\%$: 8.0%). Right panel: Convergence speed correlates with affinity—high-affinity tasks reach 90% of final performance in 15 epochs vs. 45 epochs for low-affinity tasks. Demonstrates that architectural mismatch creates efficiency ceilings independent of data quantity.
  • Figure A5: Compositional Gap Robustness Across Threshold Ranges. Heatmap showing the percentage of tasks (out of 302 fine-tuned models) exhibiting the compositional gap pattern for varying cell accuracy (70--95%) and grid accuracy (0--20%) thresholds. The blue box marks our operating point ($>80\%$ cell, $<10\%$ grid, 69.5% of tasks). The gap remains substantial ($>50\%$) across 65.2% of the threshold space, demonstrating this is a robust architectural phenomenon, not a threshold-selection artifact. See Appendix \ref{['app:reproducibility']} for generation script.
  • ...and 5 more figures