Table of Contents
Fetching ...

Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More

Arvid Frydenlund

TL;DR

The Path-Star Task (PST) tests graph-search ability for decoder-only language models under next-token prediction and reveals a CHC shortcut that causes failure at baseline $1/D$. The authors show PST learnability is possible when supervision is structured to induce subtask decomposition, using methods such as token masking, Ranking-into-the-Future (RITF), scratchpads, topology shifts to tree-star, generalized queries, and length variation. Key contributions include introducing RITF, demonstrating decomposition as essential for learnability, and showing that graph topology and online data generation significantly influence outcomes. The work highlights practical implications for planning and graph reasoning in LMs, while candidly acknowledging limitations in scaling to larger graphs and the need for decomposition-guided supervision to avoid spurious shortcuts.

Abstract

This work concerns the path-star task, a minimal example of searching over a graph. The graph, $G$, is star-shaped with $D$ arms radiating from a start node, $s$. A language model (LM) is given $G$, $s$, and a target node $t$, which ends one of the arms and is tasked with generating the arm containing $t$. The minimal nature of this task means only a single choice needs to be made: which of the $D$ arms contains $t$? Decoder-only LMs fail to solve this elementary task above $1/D$ chance due to a learned shortcut that absorbs training supervision. We show how this pathology is caused by excess supervision and we present a series of solutions demonstrating that the task is solvable via decoder-only LMs. We find that the task's minimal nature causes its difficulty, as it prevents task decomposition. Our solutions provide insight into the pathology and its implications for LMs trained via next-token prediction.

Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More

TL;DR

The Path-Star Task (PST) tests graph-search ability for decoder-only language models under next-token prediction and reveals a CHC shortcut that causes failure at baseline . The authors show PST learnability is possible when supervision is structured to induce subtask decomposition, using methods such as token masking, Ranking-into-the-Future (RITF), scratchpads, topology shifts to tree-star, generalized queries, and length variation. Key contributions include introducing RITF, demonstrating decomposition as essential for learnability, and showing that graph topology and online data generation significantly influence outcomes. The work highlights practical implications for planning and graph reasoning in LMs, while candidly acknowledging limitations in scaling to larger graphs and the need for decomposition-guided supervision to avoid spurious shortcuts.

Abstract

This work concerns the path-star task, a minimal example of searching over a graph. The graph, , is star-shaped with arms radiating from a start node, . A language model (LM) is given , , and a target node , which ends one of the arms and is tasked with generating the arm containing . The minimal nature of this task means only a single choice needs to be made: which of the arms contains ? Decoder-only LMs fail to solve this elementary task above chance due to a learned shortcut that absorbs training supervision. We show how this pathology is caused by excess supervision and we present a series of solutions demonstrating that the task is solvable via decoder-only LMs. We find that the task's minimal nature causes its difficulty, as it prevents task decomposition. Our solutions provide insight into the pathology and its implications for LMs trained via next-token prediction.

Paper Structure

This paper contains 51 sections, 3 equations, 22 figures, 9 tables.

Figures (22)

  • Figure 1: An example path-star graph. $D=12$, $M=5$, $s$ is '29', $t$ is '2', $R_t$ is '29 12 6 59 2', and $l_t$ is '12'. We omit eight incorrect arms for space. The task is to generate $R_t$ given a query, $Q=(s,\,t)$, and the graph, $G$, as a tokenized shuffled edge list (See Fig. \ref{['fig:psg2-tokenized']}).
  • Figure 2: A tokenization corresponding to Fig. \ref{['fig:psg2']}. We omit any edges belonging to the omitted incorrect arms.
  • Figure 3: Baseline results. We report the Success Rate (SR) where the model predicts $> 95$% sequential accuracy over $n=5$ seeded trials and Above-Baseline (ABB) where the model predicts $> (100/D +10)$% sequential accuracy. This happens when the model can predict $l_t$ above $1/D$ chance. As such, when ABB $>$ SR ($\uparrow$), it implies that the model has overcome the main challenge of the PST and would have learnt the task had it been provided with more training time in these cases. An 'x' further indicates no trials learnt the task.
  • Figure 4: Masking results (full Tbl. \ref{['tbl:masking-results']} in Appx. \ref{['appx:masking-results']}).
  • Figure 5: Algorithmic steps performed in the CHC and arm reconstruction, also with masking (blacked-out).
  • ...and 17 more figures