Table of Contents
Fetching ...

Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

Junnan Liu, Hongwei Liu, Linchen Xiao, Shudong Liu, Taolin Zhang, Zihan Ma, Songyang Zhang, Kai Chen

TL;DR

<3-5 sentence high-level summary> RaML reframes LLM reasoning as a meta-learning problem by treating reasoning trajectories as pseudo-gradient updates that adapt model parameters in an inner loop, while a second-order outer loop optimizes a meta-initialization for rapid adaptation. The framework connects common training paradigms (SFT, RL, PO) with meta-learning principles (MAML, L2O), and demonstrates with extensive experiments that longer, carefully structured reasoning trajectories improve both stability and performance. Empirical results show trajectory-based training enhances generalization within and across domains and reveal how token types within trajectories influence optimization dynamics. The work outlines practical directions to improve LLM reasoning through meta-learning concepts, including trajectory manipulation, efficiency improvements, and hybrid training strategies.

Abstract

We propose a novel framework for comprehending the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM's parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.

Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective

TL;DR

<3-5 sentence high-level summary> RaML reframes LLM reasoning as a meta-learning problem by treating reasoning trajectories as pseudo-gradient updates that adapt model parameters in an inner loop, while a second-order outer loop optimizes a meta-initialization for rapid adaptation. The framework connects common training paradigms (SFT, RL, PO) with meta-learning principles (MAML, L2O), and demonstrates with extensive experiments that longer, carefully structured reasoning trajectories improve both stability and performance. Empirical results show trajectory-based training enhances generalization within and across domains and reveal how token types within trajectories influence optimization dynamics. The work outlines practical directions to improve LLM reasoning through meta-learning concepts, including trajectory manipulation, efficiency improvements, and hybrid training strategies.

Abstract

We propose a novel framework for comprehending the reasoning capabilities of large language models (LLMs) through the perspective of meta-learning. By conceptualizing reasoning trajectories as pseudo-gradient descent updates to the LLM's parameters, we identify parallels between LLM reasoning and various meta-learning paradigms. We formalize the training process for reasoning tasks as a meta-learning setup, with each question treated as an individual task, and reasoning trajectories serving as the inner loop optimization for adapting model parameters. Once trained on a diverse set of questions, the LLM develops fundamental reasoning capabilities that can generalize to previously unseen questions. Extensive empirical evaluations substantiate the strong connection between LLM reasoning and meta-learning, exploring several issues of significant interest from a meta-learning standpoint. Our work not only enhances the understanding of LLM reasoning but also provides practical insights for improving these models through established meta-learning techniques.

Paper Structure

This paper contains 62 sections, 2 theorems, 23 equations, 21 figures, 7 tables, 4 algorithms.

Key Result

Proposition 2.1

There exists a set of parameters, denoted as $\theta_t^\prime$, which includes $\left\{\bm{W}_q^\prime, \bm{W}_k^\prime, \bm{W}_v^\prime, \bm{W}_1^\prime, \bm{W}_2^\prime, b_1^\prime, b_2^\prime \right\}$, allowing eq:activation-attend-first-traj-token to be expressed in the following form: where $\theta_t^\prime$ represents the one-step update of $\theta$ and the increment $\Delta \mathcal{M}_{\

Figures (21)

  • Figure 1: Illustration of the reasoning trajectory ($t$) as the optimization of the LLM parameters $\theta$.
  • Figure 2: Landscape of the plausibility regarding LLMs to generate accurate answers. We apply the methodology proposed by Li et al. Li0TSG18. The questions $q_0, q_1, q_2, q_3$ are selected from AIME24. Additionally, we project the trajectory of the pseudo-gradient update onto the landscape (purple line). Please refer to \ref{['sec:demonstrated-questions']} for more details.
  • Figure 3: Visualization of the pseudo-gradient update: The $x$-axis represents the normalized indices of corresponding trajectories. $q_0,q_1,q_2,q_3$ are question selected from AIME24, refer to \ref{['sec:demonstrated-questions']} for more details.
  • Figure 4: Performance of base models, models trained on off-policy data, and models trained on on-policy data using the AIME24 dataset, with the $x$-axis representing the amount of training data. We generate $64$ for each question and report Pass@$32$ and mG-Pass@$32$. The evaluation includes prominent models such as Sky-T1-32B sky_t1_2025, Bespoke-Stratos-32B bespoke_stratos, LIMO abs-2502-03387, s1.1-32B abs-2501-19393, OpenThinker-32B openthoughts, Light-R1-32B abs-2503-10460, DeepSeek-R1-Distill-Qwen-32B abs-2501-12948, DAPO-32B abs-2503-14476, and VAPO-32B abs-2504-05118. These models are based on either Qwen2.5-32B or Qwen2.5-32B-Instruct only through SFT or RL (Zero-RL). Since VAPO is not open source, we copy its results from the original paper.
  • Figure 5: Illustration of QwQ's pseudo-gradient update for both thinking and non-thinking modes and refer to \ref{['app:qwen3_update']} for more examples. We visualize four pairs of correct reasoning trajectories for one question in AIME24. Compared with thinking trajectories, no-thinking trajectories converge more quickly, which also easily falls into local optimal points.
  • ...and 16 more figures

Theorems & Definitions (4)

  • Proposition 2.1: One-Step Pseudo Gradient Update
  • proof
  • Theorem B.1
  • proof