Table of Contents
Fetching ...

Towards Understanding Multi-Round Large Language Model Reasoning: Approximability, Learnability and Generalizability

Chenhui Xu, Dancheng Liu, Jiajie Li, Amir Nassereldine, Zhaohui Li, Jinjun Xiong

TL;DR

This work develops a theoretical foundation for multi-round reasoning in auto-regressive LLMs by analyzing approximation, learnability, and generalization under finite context. It shows that Transformers with a bounded context window can universal-approximate finite steps of Turing machines, and that multi-round generation extends this to general TM-based sequence functions via iterative refinement. By extending PAC learning to sequence generation, it derives sample complexities for next-token and long-sequence generation, highlighting exponential growth with sequence length that can be mitigated by multi-round decomposition. The study also analyzes how generalization error propagates across rounds, revealing potential divergence without interventions, and demonstrates that strategies like Chain-of-Thought or multi-agent debates can substantially constrain cumulative error. Collectively, these results illuminate theoretical underpinnings of multi-round reasoning and offer practical guidance for training and prompting strategies to manage inference complexity in long-horizon tasks.

Abstract

Recent advancements in cognitive science and multi-round reasoning techniques for Large Language Models (LLMs) suggest that iterative thinking processes improve problem-solving performance in complex tasks. Inspired by this, approaches like Chain-of-Thought, debating, and self-refinement have been applied to auto-regressive LLMs, achieving significant successes in tasks such as mathematical reasoning, commonsense reasoning, and multi-hop question answering. Despite these successes, the theoretical basis for how multi-round reasoning enhances problem-solving abilities remains underexplored. In this work, we investigate the approximation, learnability, and generalization properties of multi-round auto-regressive models. We show that Transformers with finite context windows are universal approximators for steps of Turing-computable functions and can approximate any Turing-computable sequence-to-sequence function through multi-round reasoning. We extend PAC learning to sequence generation and demonstrate that multi-round generation is learnable even when the sequence length exceeds the model's context window. Finally, we examine how generalization error propagates across rounds, and show how the aforementioned approaches can help constrain this error, ensuring outputs stay within an expectation boundary. This work sheds light on the systemic theoretical foundations of multi-round sequence learning and reasoning, emphasizing its role in inference complexity.

Towards Understanding Multi-Round Large Language Model Reasoning: Approximability, Learnability and Generalizability

TL;DR

This work develops a theoretical foundation for multi-round reasoning in auto-regressive LLMs by analyzing approximation, learnability, and generalization under finite context. It shows that Transformers with a bounded context window can universal-approximate finite steps of Turing machines, and that multi-round generation extends this to general TM-based sequence functions via iterative refinement. By extending PAC learning to sequence generation, it derives sample complexities for next-token and long-sequence generation, highlighting exponential growth with sequence length that can be mitigated by multi-round decomposition. The study also analyzes how generalization error propagates across rounds, revealing potential divergence without interventions, and demonstrates that strategies like Chain-of-Thought or multi-agent debates can substantially constrain cumulative error. Collectively, these results illuminate theoretical underpinnings of multi-round reasoning and offer practical guidance for training and prompting strategies to manage inference complexity in long-horizon tasks.

Abstract

Recent advancements in cognitive science and multi-round reasoning techniques for Large Language Models (LLMs) suggest that iterative thinking processes improve problem-solving performance in complex tasks. Inspired by this, approaches like Chain-of-Thought, debating, and self-refinement have been applied to auto-regressive LLMs, achieving significant successes in tasks such as mathematical reasoning, commonsense reasoning, and multi-hop question answering. Despite these successes, the theoretical basis for how multi-round reasoning enhances problem-solving abilities remains underexplored. In this work, we investigate the approximation, learnability, and generalization properties of multi-round auto-regressive models. We show that Transformers with finite context windows are universal approximators for steps of Turing-computable functions and can approximate any Turing-computable sequence-to-sequence function through multi-round reasoning. We extend PAC learning to sequence generation and demonstrate that multi-round generation is learnable even when the sequence length exceeds the model's context window. Finally, we examine how generalization error propagates across rounds, and show how the aforementioned approaches can help constrain this error, ensuring outputs stay within an expectation boundary. This work sheds light on the systemic theoretical foundations of multi-round sequence learning and reasoning, emphasizing its role in inference complexity.

Paper Structure

This paper contains 46 sections, 15 theorems, 153 equations.

Key Result

Lemma 4.1

Let $\mathcal{M}$ be any deterministic Turing Machine that operates in $S$ steps. For any $\epsilon > 0$, there exists a Transformer model $\mathcal{T}$ characterized by a finite number of layers $L$, layer dimension $d$, attention window size $k$, and quantization levels $Q$, such that for all comp $\text{ such that } \forall s \leq S, \, d(H_s, \phi(C_s)) \leq \epsilon.$

Theorems & Definitions (31)

  • Definition 3.1: Sequencial PAC learnability
  • Definition 3.2
  • Lemma 4.1
  • Lemma 4.2
  • Theorem 4.3: Approximability
  • Lemma 5.5: Rademacher complexity boundary for next token prediction
  • Lemma 5.6
  • Theorem 5.7: Sample Complexity for Next-token Learning
  • Theorem 5.8: Sample Complexity for Sequence Learning
  • Theorem 5.9: Sample Complexity for Multi-Round Sequence Learning
  • ...and 21 more