Table of Contents
Fetching ...

Lita: Light Agent Uncovers the Agentic Coding Capabilities of LLMs

Hankun Dai, Maoquan Wang, Mengnan Qi, Yikai Zhang, Zijian Jin, Yongqiang Yao, Yufan Huang, Shengyu Fu, Elsie Nallipogu

TL;DR

This work addresses fair and truthful evaluation of LLM-based coding agents by reducing scaffolding and avoiding task-specific optimizations. It introduces Lita, a Lite Agent that decouples the LLM, tools, and environment and converts standard benchmarks into unified, multi-turn agentic tasks, enabling more faithful assessments. Across Aider Polyglot and SWE-Bench, Lita achieves competitive or superior performance with substantially lower token costs and design effort, supporting the Agent Complexity Law that the performance gap between simple and complex agent designs shrinks as model capability improves ($P_{ ext{rel}}$ decreases with stronger models). The approach offers a practical path toward fairer benchmarking and clearer insight into the intrinsic coding competence of modern LLMs, with implications for future agent design and evaluation environments.

Abstract

Large language models (LLMs) are increasingly being applied to programming tasks, ranging from single-turn code completion to autonomous agents. Current code agent designs frequently depend on complex, hand-crafted workflows and tool sets. However, this reliance on elaborate scaffolding presents several challenges: agent performance becomes overly dependent on prompt tuning and custom design choices, heavy human intervention obscures a model's true underlying capabilities, and intricate pipelines are costly to build and maintain. Furthermore, optimizing complex task prompts increases the risk of data leakage. Currently, when introducing new models, LLM providers like OpenAI and Anthropic often publish benchmark scores to demonstrate their models' coding proficiency, but keep their proprietary evaluation frameworks confidential. To address these limitations, we introduce Lita (Lite Agent), which operationalizes liteness, a principle of minimizing manual design while retaining the essential elements of a fully autonomous agent. Lita enables a more faithful and unified evaluation without elaborate scaffolding. Experiments on the Aider Polyglot and SWE-Bench with frontier models demonstrate that Lita achieves competitive or superior performance compared to workflow-based and agentic baselines. Crucially, Lita also consumes fewer tokens and requires significantly less design effort. Our results suggest that Lita is sufficient to reveal the underlying coding competence of modern LLMs. Finally, we propose the Agent Complexity Law: the performance gap between agents of varying complexity, from simple to sophisticated designs, will shrink as the core model improves, ultimately converging to a negligible difference.

Lita: Light Agent Uncovers the Agentic Coding Capabilities of LLMs

TL;DR

This work addresses fair and truthful evaluation of LLM-based coding agents by reducing scaffolding and avoiding task-specific optimizations. It introduces Lita, a Lite Agent that decouples the LLM, tools, and environment and converts standard benchmarks into unified, multi-turn agentic tasks, enabling more faithful assessments. Across Aider Polyglot and SWE-Bench, Lita achieves competitive or superior performance with substantially lower token costs and design effort, supporting the Agent Complexity Law that the performance gap between simple and complex agent designs shrinks as model capability improves ( decreases with stronger models). The approach offers a practical path toward fairer benchmarking and clearer insight into the intrinsic coding competence of modern LLMs, with implications for future agent design and evaluation environments.

Abstract

Large language models (LLMs) are increasingly being applied to programming tasks, ranging from single-turn code completion to autonomous agents. Current code agent designs frequently depend on complex, hand-crafted workflows and tool sets. However, this reliance on elaborate scaffolding presents several challenges: agent performance becomes overly dependent on prompt tuning and custom design choices, heavy human intervention obscures a model's true underlying capabilities, and intricate pipelines are costly to build and maintain. Furthermore, optimizing complex task prompts increases the risk of data leakage. Currently, when introducing new models, LLM providers like OpenAI and Anthropic often publish benchmark scores to demonstrate their models' coding proficiency, but keep their proprietary evaluation frameworks confidential. To address these limitations, we introduce Lita (Lite Agent), which operationalizes liteness, a principle of minimizing manual design while retaining the essential elements of a fully autonomous agent. Lita enables a more faithful and unified evaluation without elaborate scaffolding. Experiments on the Aider Polyglot and SWE-Bench with frontier models demonstrate that Lita achieves competitive or superior performance compared to workflow-based and agentic baselines. Crucially, Lita also consumes fewer tokens and requires significantly less design effort. Our results suggest that Lita is sufficient to reveal the underlying coding competence of modern LLMs. Finally, we propose the Agent Complexity Law: the performance gap between agents of varying complexity, from simple to sophisticated designs, will shrink as the core model improves, ultimately converging to a negligible difference.

Paper Structure

This paper contains 22 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The upper left sub-diagram shows the workflow agent for Aider's polyglotaider benchmark. The upper right sub-diagram shows the workflow for Agentless agentless testing of SWE-Bench. The lower sub-diagram represents our Lita autonomous agent framework, with key modules including LLM, Memory, Tools, Reasoning and Environment.
  • Figure 2: This figure presents an agent's prompt design. The left diagram shows the general components of an agent system prompt, while the right provides a specific example of Lita on SWE-Bench. Specifically, the task template requires four essential components: Initial State, Task Description, Output State, and Validation Steps.
  • Figure 3: Agent Intrinsic Complexity and performance gaps between simple and complex agent. Models are sorted increasingly by their scores on each task.
  • Figure 4: Diff block vs string replace
  • Figure 5: Tool call proportions