Table of Contents
Fetching ...

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, Kang He, Rui Ma, Jifan Lin, Jie Sun, Yang Xiao, Sizhuo Zhou, Wenxiao Wu, Yiming Liu, Pengfei Liu, Yu Qiao, Shenglin Zhang, Kaipeng Zhang

TL;DR

LongCLI-Bench introduces a long-horizon, environment-grounded benchmark for agentic CLI programming by curating 20 complex tasks from CS assignments and real-world workflows. It employs dual-set F2P/P2P evaluation with step-level scoring to diagnose progress and failure points, revealing that current agents struggle to surpass $20\%$ pass rates and often fail early in workflows. The study demonstrates that self-correction helps but is outpaced by strategic human planning and interactive guidance, with plan injection and dynamic human input yielding the strongest gains. The results underscore the need for synergistic human–agent workflows and advances in long-horizon planning, environmental grounding, and regression-aware execution to tackle realistic software-engineering tasks.

Abstract

Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents' planning and execution capabilities to overcome key challenges in long-horizon task performance.

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

TL;DR

LongCLI-Bench introduces a long-horizon, environment-grounded benchmark for agentic CLI programming by curating 20 complex tasks from CS assignments and real-world workflows. It employs dual-set F2P/P2P evaluation with step-level scoring to diagnose progress and failure points, revealing that current agents struggle to surpass pass rates and often fail early in workflows. The study demonstrates that self-correction helps but is outpaced by strategic human planning and interactive guidance, with plan injection and dynamic human input yielding the strongest gains. The results underscore the need for synergistic human–agent workflows and advances in long-horizon planning, environmental grounding, and regression-aware execution to tackle realistic software-engineering tasks.

Abstract

Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents' planning and execution capabilities to overcome key challenges in long-horizon task performance.
Paper Structure (26 sections, 3 figures, 4 tables)

This paper contains 26 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Task sample in LongCLI-Bench.
  • Figure 2: The LongCLI-Bench construction pipeline. We curate tasks from diverse sources and employ a parallel construction method for solutions and tests. The pipeline features a strict Dual-Set Verification mechanism with iterative refinement loops to ensure high-quality, contamination-free benchmarks across various engineering and domain categories.
  • Figure 3: Multi-Turn Self-Correction Performance.