Table of Contents
Fetching ...

THREAD: Thinking Deeper with Recursive Spawning

Philip Schroeder, Nathaniel Morgan, Hongyin Luo, James Glass

TL;DR

This work tackles the challenge of long-context and complex reasoning in large language models by introducing THREAD, a recursive spawning framework that models generation as interacting threads. A parent thread can offload intermediate thinking or data gathering to child threads, which return only the necessary tokens, enabling dynamic adjustment of computational effort. Across five diverse benchmarks (including ALFWorld, TextCraft, WebShop, DataCommons QA, and MIMIC-III ICU QA) and multiple model scales, THREAD achieves state-of-the-art results, with pronounced gains for smaller models, and demonstrates robust task decomposition and feedback-driven adaptation. The approach relies on a unified, few-shot prompting strategy and join synchronization to coordinate tasks, offering significant practical impact for scalable, data-grounded reasoning in real-world, multi-step tasks.

Abstract

Large language models (LLMs) have shown impressive capabilities across diverse settings, but still struggle as the length and complexity of the context increases. To address this challenge, we propose Thinking Recursively and Dynamically (ThReaD). THREAD frames model generation as a thread of execution that, based on the context, can run to completion or dynamically spawn new threads. By spawning, threads can offload work (e.g., thinking, retrieving information) to child threads, which only return tokens needed for the parent thread to do its work. In effect, this enables the model to adapt, as needed, the amount of intermediate work used to produce tokens. We apply THREAD in the settings of LLM task solving and question answering, where the dynamic threading allows the model to recursively decompose the given task or question into progressively simpler sub-problems that can be solved by separate child threads. We test THREAD, implemented using a few-shot learning approach, on diverse benchmarks for agent tasks and data-grounded question answering. THREAD achieves state-of-the-art performance with GPT-4 and GPT-3.5 on these benchmarks, including ALFWorld, TextCraft, and WebShop, along with two new benchmarks, DataCommons QA and MIMIC-III ICU QA. In addition, THREAD outperforms existing frameworks by 10% to 50% absolute points with smaller models, including Llama-3-8b and CodeLlama-7b.

THREAD: Thinking Deeper with Recursive Spawning

TL;DR

This work tackles the challenge of long-context and complex reasoning in large language models by introducing THREAD, a recursive spawning framework that models generation as interacting threads. A parent thread can offload intermediate thinking or data gathering to child threads, which return only the necessary tokens, enabling dynamic adjustment of computational effort. Across five diverse benchmarks (including ALFWorld, TextCraft, WebShop, DataCommons QA, and MIMIC-III ICU QA) and multiple model scales, THREAD achieves state-of-the-art results, with pronounced gains for smaller models, and demonstrates robust task decomposition and feedback-driven adaptation. The approach relies on a unified, few-shot prompting strategy and join synchronization to coordinate tasks, offering significant practical impact for scalable, data-grounded reasoning in real-world, multi-step tasks.

Abstract

Large language models (LLMs) have shown impressive capabilities across diverse settings, but still struggle as the length and complexity of the context increases. To address this challenge, we propose Thinking Recursively and Dynamically (ThReaD). THREAD frames model generation as a thread of execution that, based on the context, can run to completion or dynamically spawn new threads. By spawning, threads can offload work (e.g., thinking, retrieving information) to child threads, which only return tokens needed for the parent thread to do its work. In effect, this enables the model to adapt, as needed, the amount of intermediate work used to produce tokens. We apply THREAD in the settings of LLM task solving and question answering, where the dynamic threading allows the model to recursively decompose the given task or question into progressively simpler sub-problems that can be solved by separate child threads. We test THREAD, implemented using a few-shot learning approach, on diverse benchmarks for agent tasks and data-grounded question answering. THREAD achieves state-of-the-art performance with GPT-4 and GPT-3.5 on these benchmarks, including ALFWorld, TextCraft, and WebShop, along with two new benchmarks, DataCommons QA and MIMIC-III ICU QA. In addition, THREAD outperforms existing frameworks by 10% to 50% absolute points with smaller models, including Llama-3-8b and CodeLlama-7b.
Paper Structure (36 sections, 8 figures, 12 tables, 2 algorithms)

This paper contains 36 sections, 8 figures, 12 tables, 2 algorithms.

Figures (8)

  • Figure 1: Thread with join synchronization.Thread frames model generation as an execution thread that can dynamically spawn new threads. In the example with join synchronization, when a thread spawns a child, it pauses generation until feedback is returned. Child threads generate starting from context derived from their parent's token sequence. When the child completes, it returns output tokens (colored bars), which are added to the context of the parent before it continues generating. $\phi$ and $\psi$ are functions that control information flow from parent to child and from child to parent, respectively, by defining the tokens that are propagated based on the thread's token sequence.
  • Figure 2: When given a task (a) or question (b), Thread can be used to help the model, through recursive spawning, decompose the problem into progressively simpler sub-problems that are solved by child threads. In these examples, the context for a child thread is based on the last line of the parent's token sequence.
  • Figure 3: Thread allows model to adapt amount of supplemental work used to produce tokens.
  • Figure 4: Failure counts in ALFWorld and TextCraft for different methods (a and b) and for modified versions of Thread (c and d).
  • Figure 5: Number of times GPT-3.5 misinterprets environment feedback when executing action sequences in ALFWorld (a) and TextCraft (b). Values are shown as $\text{log}_{10}\text{(count)}$ to allow the total counts (shown in gray) to fit in the same figure as the error counts (shown in green).
  • ...and 3 more figures