Table of Contents
Fetching ...

KLong: Training LLM Agent for Extremely Long-horizon Tasks

Yue Liu, Zhiyuan Hu, Flood Sung, Jiaheng Zhang, Bryan Hooi

TL;DR

KLong tackles the challenge of solving extremely long-horizon tasks with context windows that are smaller than the task horizon. It introduces Research-Factory to generate high-quality, long-horizon training data from research papers, distilling thousands of long trajectories from Claude 4.5 Sonnet. The training pipeline combines trajectory-splitting supervised fine-tuning with a prefix-containing context and overlapping sub-trajectories, followed by progressive reinforcement learning across staged timeouts to stabilize learning and extend horizon reasoning. Empirical results show that KLong-106B achieves state-of-the-art open-source performance on PaperBench and generalizes well to SWE-bench Verified, MLE-bench, and other benchmarks, outperforming several baselines and even some closed-source models on specific tasks. These findings suggest a practical path to building general-purpose LLM agents capable of sustaining long-duration reasoning, planning, and experimentation in complex coding and research-reproduction tasks.

Abstract

This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.

KLong: Training LLM Agent for Extremely Long-horizon Tasks

TL;DR

KLong tackles the challenge of solving extremely long-horizon tasks with context windows that are smaller than the task horizon. It introduces Research-Factory to generate high-quality, long-horizon training data from research papers, distilling thousands of long trajectories from Claude 4.5 Sonnet. The training pipeline combines trajectory-splitting supervised fine-tuning with a prefix-containing context and overlapping sub-trajectories, followed by progressive reinforcement learning across staged timeouts to stabilize learning and extend horizon reasoning. Empirical results show that KLong-106B achieves state-of-the-art open-source performance on PaperBench and generalizes well to SWE-bench Verified, MLE-bench, and other benchmarks, outperforming several baselines and even some closed-source models on specific tasks. These findings suggest a practical path to building general-purpose LLM agents capable of sustaining long-duration reasoning, planning, and experimentation in complex coding and research-reproduction tasks.

Abstract

This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.
Paper Structure (19 sections, 9 equations, 9 figures, 8 tables)

This paper contains 19 sections, 9 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Performance of KLong on 5 Agentic Benchmarks. The training is tailored to PaperBench, and generalizes long-horizon ability to the other 4 benchmarks.
  • Figure 2: Long-horizon Task vs. Extremely Long-horizon Task. We demonstrate the challenges of the extremely long-horizon tasks by comparing 2 extremely long-horizon tasks, MLE-bench mlebench & PaperBench paperbench, with 2 long-horizon tasks, SWE-bench Verified swebench_verified & Terminal-Bench 2.0 terminal_bench, in terms of time and turns.
  • Figure 3: Research-Factory: Pipeline of Scaling Training Data for Research Reproducing Task. First, the search agent collects basic data of accepted papers from ICML, NeurIPS, and ICLR conferences. Then, the filter selects the data based on the quality and impact of the papers. The PDF is converted to Markdown. The official GitHub URL is added to the blacklist.txt file to avoid cheating. Last, the evaluation agent designs the addendum and the rubric tree by analyzing the paper and the official code implementation.
  • Figure 4: Trajectory-splitting Supervised Fine-tuning. To train with extremely long trajectories, we split them by 1) pinning the paper-reading segment at the beginning of the context, 2) progressively truncating the context to fit the context window, and 3) overlapping sub-trajectories to preserve contextual continuity.
  • Figure 5: Pipeline Imbalance in Extremely Long-horizon RL. Because full tasks are prohibitively long, a fixed timeout causes rollouts to end synchronously, triggering a congested synchronous judge and leaving rollout nodes idle. We mitigate this issue via partial rollouts and a priority-based judge queue.
  • ...and 4 more figures